UnicodeDecodeError when reading CSV file in Pandas with Python

Question

I m running a program which is processing 30 000 similar files  A random number of them are stopping and producing this error    File  quot C  Importer src dfman importer py quot   line 26  in import chr      data   pd read csv filepath  names fields  File  quot C  Python33 lib site-packages pandas io parsers py quot   line 400  in parser f      return  read filepath or buffer  kwds  File  quot C  Python33 lib site-packages pandas io parsers py quot   line 205  in  read      return parser read      File  quot C  Python33 lib site-packages pandas io parsers py quot   line 608  in read      ret   self  engine read nrows  File  quot C  Python33 lib site-packages pandas io parsers py quot   line 1028  in read      data   self  reader read nrows  File  quot parser pyx quot   line 706  in pandas parser TextReader read  pandas parser c 6745  File  quot parser pyx quot   line 728  in pandas parser TextReader  read low memory  pandas parser c 6964  File  quot parser pyx quot   line 804  in pandas parser TextReader  read rows  pandas parser c 7780  File  quot parser pyx quot   line 890  in pandas parser TextReader  convert column data  pandas parser c 8793  File  quot parser pyx quot   line 950  in pandas parser TextReader  convert tokens  pandas parser c 9484  File  quot parser pyx quot   line 1026  in pandas parser TextReader  convert with dtype  pandas parser c 10642  File  quot parser pyx quot   line 1046  in pandas parser TextReader  string convert  pandas parser c 10853  File  quot parser pyx quot   line 1278  in pandas parser  string box utf8  pandas parser c 15657  UnicodeDecodeError   utf-8  codec can t decode byte 0xda in position 6  invalid    continuation byte  The source creation of these files all come from the same place  What s the best way to correct this to proceed with the import

User · Answer

In my case  a file has USC-2 LE BOM encoding  according to Notepad     It is encoding  utf 16 le  for python    Hope  it helps to find an answer a bit faster for someone

User · Answer

Try this   import pandas as pd with open  filename csv   as f      data   pd read csv f    Looks like it will take care of the encoding without explicitly expressing it through argument

User · Answer

I am posting an update to this old thread  I found one solution that worked  but requires opening each file  I opened my csv file in LibreOffice  chose Save As   edit filter settings  In the drop-down menu I chose UTF8 encoding  Then I added encoding  utf-8-sig  to the data   pd read csv r C  fullpathtofile filename csv   sep        encoding  utf-8-sig     Hope this helps someone

User · Answer

read csv takes an encoding option to deal with files in different formats  I mostly use read csv  file   encoding    ISO-8859-1    or alternatively encoding    utf-8  for reading  and generally utf-8 for to csv   You can also use one of several alias options like  latin  instead of  ISO-8859-1   see python docs  also for numerous other encodings you may encounter    See relevant Pandas documentation  python docs examples on csv files  and plenty of related questions here on SO  A good background resource is What every developer should know about unicode and character sets   To detect the encoding  assuming the file contains non-ascii characters   you can use enca  see man page  or file -i  linux  or file -I  osx   see man page

User · Answer

Pandas allows to specify encoding  but does not allow to ignore errors not to automatically replace the offending bytes  So there is no one size fits all method but different ways depending on the actual use case    You know the encoding  and there is no encoding error in the file   Great  you have just to specify the encoding   file encoding    cp1252           set file encoding to the file encoding  utf8  latin1  etc   pd read csv input file and path       encoding file encoding   You do not want to be bothered with encoding questions  and only want that damn file to load  no matter if some text fields contain garbage  Ok  you only have to use Latin1 encoding because it accept any possible byte as input  and convert it to the unicode character of same code    pd read csv input file and path       encoding  latin1    You know that most of the file is written with a specific encoding  but it also contains encoding errors  A real world example is an UTF8 file that has been edited with a non utf8 editor and which contains some lines with a different encoding  Pandas has no provision for a special error processing  but Python open function has  assuming Python3   and read csv accepts a file like object  Typical errors parameter to use here are  ignore  which just suppresses the offending bytes or  IMHO better   backslashreplace  which replaces the offending bytes by their Python   s backslashed escape sequence   file encoding    utf8           set file encoding to the file encoding  utf8  latin1  etc   input fd   open input file and path  encoding file encoding  errors    backslashreplace   pd read csv input fd

User · Answer

I have trouble opening a CSV file in simplified Chinese downloaded from an online bank   I have tried latin1  I have tried iso-8859-1  I have tried cp1252  all to no avail   But pd read csv    encoding   gbk   simply does the work

User · Answer

Check the encoding before you pass to pandas  It will slow you down  but      with open path   r   as f      encoding   f encoding   df   pd read csv path sep sep  encoding encoding    In python 3 7

User · Answer

Try specifying the engine  python    It worked for me but I m still trying to figure out why   df   pd read csv input file path    engine  python

User · Answer

Another important issue that I faced which resulted in the same error was   values   pd read csv  quot C  Users Mujeeb Desktop file xlxs quot      This line resulted in the same error because I am reading an excel file using read csv   method  Use read excel   for reading  xlxs

User · Answer

You can try this   import csv import pandas as pd df   pd read csv filepath encoding  unicode escape

User · Answer

with open  filename csv   as f     print f    after executing this code you will find encoding of  filename csv  then execute code as following  data pd read csv  filename csv   encoding  encoding as you found earlier    there you go

User · Answer

This is a more general script approach for the stated question  import pandas as pd  encoding list     ascii    big5    big5hkscs    cp037    cp273    cp424    cp437    cp500    cp720    cp737                      cp775    cp850    cp852    cp855    cp856    cp857    cp858    cp860    cp861    cp862                      cp863    cp864    cp865    cp866    cp869    cp874    cp875    cp932    cp949    cp950                      cp1006    cp1026    cp1125    cp1140    cp1250    cp1251    cp1252    cp1253    cp1254                      cp1255    cp1256    cp1257    cp1258    euc jp    euc jis 2004    euc jisx0213    euc kr                      gb2312    gbk    gb18030    hz    iso2022 jp    iso2022 jp 1    iso2022 jp 2                      iso2022 jp 2004    iso2022 jp 3    iso2022 jp ext    iso2022 kr    latin 1    iso8859 2                      iso8859 3    iso8859 4    iso8859 5    iso8859 6    iso8859 7    iso8859 8    iso8859 9                      iso8859 10    iso8859 11    iso8859 13    iso8859 14    iso8859 15    iso8859 16    johab                      koi8 r    koi8 t    koi8 u    kz1048    mac cyrillic    mac greek    mac iceland    mac latin2                      mac roman    mac turkish    ptcp154    shift jis    shift jis 2004    shift jisx0213    utf 32                      utf 32 be    utf 32 le    utf 16    utf 16 be    utf 16 le    utf 7    utf 8    utf 8 sig    for encoding in encoding list      worked   True     try          df   pd read csv path  encoding encoding  nrows 5      except          worked   False     if worked          print encoding     n   df head     One starts with all the standard encodings available for the python version  in this case 3 7 python 3 7 standard encodings   A usable python list of the standard encodings for the different python version is provided here  Helpful Stack overflow answer Trying each encoding on a small chunk of the data  only printing the working encoding  The output is directly obvious  This output also addresses the problem that an encoding like  latin1  that runs through with ought any error  does not necessarily produce the wanted outcome  In case of the question  I would try this approach specific for problematic CSV file and then maybe try to use the found working encoding for all others

User · Answer

Struggled with this a while and thought I d post on this question as it s the first search result   Adding the encoding  iso-8859-1  tag to pandas read csv didn t work  nor did any other encoding  kept giving a UnicodeDecodeError    If you re passing a file handle to pd read csv    you need to put the encoding attribute on the file open  not in read csv  Obvious in hindsight  but a subtle error to track down

User · Answer

Sometimes the problem is with the  csv file only  The file may be corrupted  When faced with this issue   Save As  the file as csv again  0  Open the xls csv file 1  Go to - gt  files  2  Click - gt  Save As  3  Write the file name  4  Choose  file type  as - gt  CSV  very important  5  Click - gt  Ok

User · Answer

Try changing the encoding  In my case  encoding    quot utf-16 quot  worked  df   pd read csv  quot file csv quot  encoding  utf-16

User · Answer

Please try to add  encoding  unicode escape    This will help  Worked for me  Also  make sure you re using the correct delimiter and column names   You can start with loading just 1000 rows to load the file quickly

User · Answer

Simplest of all Solutions  import pandas as pd df   pd read csv  file name csv   engine  python    Alternate Solution   Open the csv file in Sublime text editor or VS Code  Save the file in utf-8 format    In sublime  Click File - gt  Save with encoding - gt  UTF-8  Then  you can read your file as usual  import pandas as pd data   pd read csv  file name csv   encoding  utf-8    and the other different encoding types are  encoding    quot cp1252 quot  encoding    quot ISO-8859-1 quot

User · Answer

In my case this worked for python 2 7   data   read csv filename  encoding    ISO-8859-1   dtype   name of colum   unicode   low memory False     And for python 3  only   data   read csv filename  encoding    ISO-8859-1   low memory False

User · Answer

I am using Jupyter-notebook  And in my case  it was showing the file in the wrong format  The  encoding  option was not working  So I save the csv in utf-8 format  and it works

User · Answer

I am posting an answer to provide an updated solution and explanation as to why this problem can occur  Say you are getting this data from a database or Excel workbook  If  you have special characters like La Ca  ada Flintridge city  well unless you are exporting the data using UTF-8 encoding  you re going to introduce errors  La Ca  ada Flintridge city will become La Ca xf1ada Flintridge city  If you are using pandas read csv without any adjustments to the default parameters  you ll hit the following error   UnicodeDecodeError   utf-8  codec can t decode byte 0xf1 in position 5  invalid continuation byte   Fortunately  there are a few solutions    Option 1  fix the exporting  Be sure to use UTF-8 encoding    Option 2  if fixing the exporting problem is not available to you  and you need to use pandas read csv  be sure to include the following paramters  engine  python   By default  pandas uses engine  C  which is great for reading large clean files  but will crash if anything unexpected comes up  In my experience  setting encoding  utf-8  has never fixed this UnicodeDecodeError  Also  you do not need to use errors bad lines  however  that is still an option if you REALLY need it   pd read csv  lt your file gt   engine  python     Option 3  solution is my preferred solution personally  Read the file using vanilla Python   import pandas as pd  data       with open  lt your file gt    rb   as myfile        read the header seperately       decode it as  utf-8   remove any special characters  and split it on the comma  or deliminator      header   myfile readline   decode  utf-8   replace   r n       split            read the rest of the data     for line in myfile          row   line decode  utf-8   errors  ignore   replace   r n       split              data append row     save the data as a dataframe df   pd DataFrame data data  columns   header    Hope this helps people encountering this issue for the first time

User · Answer

You can try with  df   pd read csv    file name csv   encoding  gbk

User · Answer

This answer seems to be the catch-all for CSV encoding issues  If you are getting a strange encoding problem with your header like this    gt  gt  gt  f   open filename  r    gt  gt  gt  reader   DictReader f   gt  gt  gt  next reader  OrderedDict     ufeffid    1             Then you have a byte order mark  BOM  character at the beginning of your CSV file  This answer addresses the issue   Python read csv - BOM embedded into the first key  The solution is to load the CSV with encoding  utf-8-sig     gt  gt  gt  f   open filename  r   encoding  utf-8-sig    gt  gt  gt  reader   DictReader f   gt  gt  gt  next reader  OrderedDict    id    1             Hopefully this helps someone

[python] UnicodeDecodeError when reading CSV file in Pandas with Python

Examples related to python

Examples related to pandas

Examples related to csv

Examples related to dataframe

Examples related to unicode