UnicodeDecodeError utf8 codec can t decode bytes in position 3-6 invalid data

Question

how does the unicode thing works on python2  i just dont get it   here i download data from a server and parse it for JSON   Traceback  most recent call last     File   usr local lib python2 6 dist-packages eventlet-0 9 12-py2 6 egg eventlet hubs poll py   line 92  in wait     readers get fileno  noop  cb fileno    File   usr local lib python2 6 dist-packages eventlet-0 9 12-py2 6 egg eventlet greenthread py   line 202  in main     result   function  args    kwargs    File  android suggest py   line 60  in fetch     suggestions   suggest chars    File  android suggest py   line 28  in suggest     return  i  s   for i in json loads opener open  https   market android com suggest SuggRequest json 1 amp query   s   amp hl de amp gl DE   read       File   usr lib python2 6 json   init   py   line 307  in loads     return  default decoder decode s    File   usr lib python2 6 json decoder py   line 319  in decode     obj  end   self raw decode s  idx  w s  0  end      File   usr lib python2 6 json decoder py   line 336  in raw decode     obj  end   self  scanner iterscan s    kw  next     File   usr lib python2 6 json scanner py   line 55  in iterscan     rval  next pos   action m  context    File   usr lib python2 6 json decoder py   line 217  in JSONArray     value  end   iterscan s  idx end  context context  next     File   usr lib python2 6 json scanner py   line 55  in iterscan     rval  next pos   action m  context    File   usr lib python2 6 json decoder py   line 183  in JSONObject     value  end   iterscan s  idx end  context context  next     File   usr lib python2 6 json scanner py   line 55  in iterscan     rval  next pos   action m  context    File   usr lib python2 6 json decoder py   line 155  in JSONString     return scanstring match string  match end    encoding  strict  UnicodeDecodeError   utf8  codec can t decode bytes in position 3-6  invalid data   thank you    EDIT  the following string causes the error      t   q   s   abh xf6ren       xf6 should be decoded to     abh  ren

User · Answer

Just in case of someone has the same problem  I am using vim with YouCompleteMe  failed to start ycmd with this error message  what I did is  export LC CTYPE  en US UTF-8   the problem is gone

User · Answer

The error you re seeing means the data you receive from the remote end isn t valid JSON  JSON  according to the specifiation  is normally UTF-8  but can also be UTF-16 or UTF-32  in either big- or little-endian   The exact error you re seeing means some part of the data was not valid UTF-8  and also wasn t UTF-16 or UTF-32  as those would produce different errors    Perhaps you should examine the actual response you receive from the remote end  instead of blindly passing the data to json loads    Right now  you re reading all the data from the response into a string and assuming it s JSON  Instead  check the content type of the response  Make sure the webpage is actually claiming to give you JSON and not  for example  an error message that isn t JSON    Also  after checking the response use json load   by passing it the file-like object returned by opener open    instead of reading all data into a string and passing that to json loads

User · Answer

The string you re trying to parse as a JSON is not encoded in UTF-8  Most likely it is encoded in ISO-8859-1  Try the following   json loads unicode opener open        ISO-8859-1      That will handle any umlauts that might get in the JSON message   You should read Joel Spolsky s The Absolute Minimum Every Software Developer Absolutely  Positively Must Know About Unicode and Character Sets  No Excuses    I hope that it will clarify some issues you re having around Unicode

User · Answer

The solution to change the encoding to Latin1   ISO-8859-1 solves an issue I observed with html2text py as invoked on an output of tex4ht  I use that for an automated word count on LaTeX documents  tex4ht converts them to HTML  and then html2text py strips them down to pure text for further counting through wc -w  Now  if  for example  a German  Umlaut  comes in through a literature database entry  that process would fail as html2text py would complain e g   UnicodeDecodeError   utf8  codec can t decode bytes in position 32243-32245  invalid data  Now these errors would then subsequently be particularly hard to track down  and essentially you want to have the Umlaut in your references section  A simple change inside html2text py from  data   data decode encoding   to  data   data decode  ISO-8859-1    solves that issue  if you re calling the script using the HTML file as first parameter  you can also pass the encoding as second parameter and spare the modification

User · Answer

Paste this on your command line   export LC CTYPE  en US UTF-8

User · Answer

My solution is a bit funny I never thought that would it be as easy as save as with UTF-8 codec I m using notepad   v5 6 8  I didn t notice that I saved it with ANSI codec initially  I m using separate file to place all localized dictionary  I found my solution under  Encoding  tab from my Notepad   I select  Encoding in UTF-8 without BOM  and save it  It works brilliantly

User · Answer

Temporary workaround  unicode urllib2 urlopen url  read     utf8   - this should work if what is returned  is UTF-8   urlopen   read   return bytes and you have to decode them to unicode strings  Also it would be helpful to check the patch from http   bugs python org issue4733

User · Answer

In your android suggest py  break up that monstrous one-liner return statement into one step at a time pieces  Log repr string passed to json loads  somewhere so that it can be checked after an exception happens  Eye-ball the results  If the problem is not evident  edit your question to show the repr

[python] UnicodeDecodeError: 'utf8' codec can't decode bytes in position 3-6: invalid data

Examples related to python

Examples related to unicode

Examples related to python-2.x