UnicodeDecodeError invalid continuation byte

Question

Why is the below item failing  Why does it succeed with  quot latin-1 quot  codec  o    quot a test of  xe9 char quot   I want this to remain a string as this is what I am receiving v   o decode  quot utf-8 quot    Which results in   Traceback  most recent call last      File  quot  lt stdin gt  quot   line 1  in  lt module gt     File  quot C  Python27 lib encodings utf 8 py quot    line 16  in decode      return codecs utf 8 decode input  errors  True  UnicodeDecodeError    utf8  codec can t decode byte 0xe9 in position 10  invalid continuation byte

User · Answer

This happened to me also  while i was reading text containing Hebrew from a  txt file    I clicked   file - gt  save as and I saved this file as a UTF-8 encoding

User · Answer

Well this type of error comes when u are taking input a particular file or data in pandas such as  - data pd read csv   kaggle input fertilizers-by-product-fao FertilizersProduct csv   Then the error is displaying like this  - UnicodeDecodeError   utf-8  codec can t decode byte 0xf4 in position 1  invalid continuation byte So to avoid this type of error can be removed by adding an argument data pd read csv   kaggle input fertilizers-by-product-fao FertilizersProduct csv   encoding  ISO-8859-1

User · Answer

Because UTF-8 is multibyte and there is no char corresponding to your combination of  xe9 plus following space   Why should it succeed in both utf-8 and latin-1   Here how the same sentence should be in utf-8    gt  gt  gt  o decode  latin-1   encode  utf-8    a test of  xc3 xa9 char

User · Answer

In binary  0xE9 looks like 1110 1001  If you read about UTF-8 on Wikipedia  you   ll see that such a byte must be followed by two of the form 10xx xxxx  So  for example    gt  gt  gt  b  xe9 x80 x80  decode  utf-8   u  u9000    But that   s just the mechanical cause of the exception  In this case  you have a string that is almost certainly encoded in latin 1  You can see how UTF-8 and latin 1 look different    gt  gt  gt  u  xe9  encode  utf-8   b  xc3 xa9   gt  gt  gt  u  xe9  encode  latin-1   b  xe9     Note  I m using a mix of Python 2 and 3 representation here  The input is valid in any version of Python  but your Python interpreter is unlikely to actually show both unicode and byte strings in this way

User · Answer

If this error arises when manipulating a file that was just opened  check to see if you opened it in  rb  mode

User · Answer

Use this  If it shows the error of UTF-8   pd read csv  File name csv  encoding  latin-1

User · Answer

I had the same error when I tried to open a CSV file by pandas read csv method  The solution was change the encoding to latin-1  pd read csv  ml-100k u item   sep      names m cols   encoding  latin-1

User · Answer

It is invalid UTF-8   That character is the e-acute character in ISO-Latin1  which is why it succeeds with that codeset   If you don t know the codeset you re receiving strings in  you re in a bit of trouble   It would be best if a single codeset  hopefully UTF-8  would be chosen for your protocol application and then you d just reject ones that didn t decode   If you can t do that  you ll need heuristics

User · Answer

In this case  I tried to execute a  py which active a path file sql   My solution was to modify the codification of the file sql to  UTF-8 without BOM  and it works   You can do it with Notepad     i will leave a part of my code     Code   con psycopg2 connect host   sys argv 1   port   sys argv 2  dbname   sys argv 3  user   sys argv 4   password   sys argv 5    cursor   con cursor   sqlfile   open path   r

User · Answer

utf-8 code error usually comes when the range of numeric values exceeding 0 to 127   the reason to raise this exception is   1 If the code point is  lt  128  each byte is the same as the value of the code point  2 If the code point is 128 or greater  the Unicode string can   t be represented in this encoding   Python raises a UnicodeEncodeError exception in this case    In order to to overcome this we have a set of encodings  the most widely used is  Latin-1  also known as ISO-8859-1    So ISO-8859-1 Unicode points 0   255 are identical to the Latin-1 values  so converting to this encoding simply requires converting code points to byte values  if a code point larger than 255 is encountered  the string can   t be encoded into Latin-1  when this exception occurs when you are trying to load a data set  try using this format  df pd read csv  top50 csv  encoding  ISO-8859-1     Add encoding technique at the end of the syntax which then accepts to load the data set

[python] UnicodeDecodeError, invalid continuation byte

Examples related to python

Examples related to unicode

Examples related to decode