Python Pandas Error tokenizing data

Question

I m trying to use pandas to manipulate a  csv file but I get this error      pandas parser CParserError  Error tokenizing data  C error  Expected 2 fields in line 3   saw 12   I have tried to read the pandas docs  but found nothing   My code is simple   path    GOOG Key Ratios csv   print open path  read    data   pd read csv path    How can I resolve this  Should I use the csv module or another language    File is from Morningstar

User · Answer

This is definitely an issue of delimiter  as most of the csv CSV are got create using sep   t  so try to read csv using the tab character   t  using separator  t  so  try to open using following code line   data pd read csv  File path   sep   t

User · Answer

Issue could be with file Issues  In my case  Issue was solved after renaming the file  yet to figure out the reason

User · Answer

Use delimiter in parameter   pd read csv filename  delimiter      encoding  utf-8     It will read

User · Answer

I ve had this problem a few times myself  Almost every time  the reason is that the file I was attempting to open was not a properly saved CSV to begin with  And by  properly   I mean each row had the same number of separators or columns    Typically it happened because I had opened the CSV in Excel then improperly saved it  Even though the file extension was still  csv  the pure CSV format had been altered    Any file saved with pandas to csv will be properly formatted and shouldn t have that issue  But if you open it with another program  it may change the structure    Hope that helps

User · Answer

The issue for me was that a new column was appended to my CSV intraday  The accepted answer solution would not work as every future row would be discarded if I used error bad lines False    The solution in this case was to use the usecols parameter in pd read csv    This way I can specify only the columns that I need to read into the CSV and my Python code will remain resilient to future CSV changes so long as a header column exists  and the column names do not change      usecols   list-like or callable  optional   Return a subset of the columns  If list-like  all elements must either be positional  i e  integer indices into the document columns  or strings that correspond to column names provided either by the user in names or inferred from the document header row s   For example  a valid list-like usecols parameter would be  0  1  2  or   foo    bar    baz    Element order is ignored  so usecols  0  1  is the same as  1  0   To instantiate a DataFrame from data with element order preserved use pd read csv data  usecols   foo    bar      foo    bar    for columns in   foo    bar   order or pd read csv data  usecols   foo    bar      bar    foo    for   bar    foo   order     Example  my columns     foo    bar    bob   df   pd read csv file path  usecols my columns    Another benefit of this is that I can load way less data into memory if I am only using 3-4 columns of a CSV that has 18-20 columns

User · Answer

I had a similar error and the issue was that I had some escaped quotes in my csv file and needed to set the escapechar parameter appropriately

User · Answer

For those who are having similar issue with Python 3 on linux OS   pandas errors ParserError  Error tokenizing data  C error  Calling read nbytes  on source failed  Try engine  python     Try    df read csv  file csv   encoding  utf8   engine  python

User · Answer

I have encountered this error with a stray quotation mark   I use mapping software which will put quotation marks around text items when exporting comma-delimited files   Text which uses quote marks  e g      feet and     inches  can be problematic   Consider this example which notes that a 5-inch well log print is poor   UWI key Latitude Longitude Remark US42051316890000 30 4386484 -96 4330734  poor 5    Using 5  as shorthand for 5 inch ends up throwing a wrench in the works  Excel will simply strip off the extra quote mark  but Pandas breaks down without the error bad lines False argument mentioned above   Once you know the nature of your error  it may be easiest to do a Find-Replace from a text editor  e g   Sublime Text 3 or Notepad    prior to import

User · Answer

you could also try   data   pd read csv  file1 csv   error bad lines False    Do note that this will cause the offending lines to be skipped

User · Answer

You can try  data   pd read csv  file1 csv   sep   t

User · Answer

try    pandas read csv path  sep        header None

User · Answer

I had a similar case as this and setting  train   pd read csv  input csv    encoding  latin1  engine  python      worked

User · Answer

This is what I did   sep      solved my issue    data pd read csv  C   Users  HP  Downloads  NPL ASSINGMENT 2 imdb labelled  imdb labelled txt  engine  python  header None sep

User · Answer

Sometimes the problem is not how to use python  but with the raw data  I got this error message   Error tokenizing data  C error  Expected 18 fields in line 72  saw 19    It turned out that in the column description there were sometimes commas   This means that the CSV file needs to be cleaned up or another separator used

User · Answer

Although not the case for this question  this error may also appear with compressed data  Explicitly setting the value for kwarg compression resolved my problem   result   pandas read csv data source  compression  gzip

User · Answer

You can do this step to avoid the problem -   train   pd read csv   home Project output csv    header None    just add - header None  Hope this helps

User · Answer

The following worked for me  I posted this answer  because I specifically had this problem in a Google Colaboratory Notebook    df   pd read csv   path foo csv   delimiter      skiprows 0  low memory False

User · Answer

I had received a  csv from a coworker and when I tried to read the csv using pd read csv    I received a similar error  It was apparently attempting to use the first row to generate the columns for the dataframe  but there were many rows which contained more columns than the first row would imply  I ended up fixing this problem by simply opening and re-saving the file as  csv and using pd read csv   again

User · Answer

I had a dataset with prexisting row numbers  I used index col    pd read csv  train csv   index col 0

User · Answer

In my case the separator was not the default     but Tab   pd read csv file name csv  sep    t  lineterminator    r   engine  python   header  infer     Note    t  did not work as suggested by some sources      t  was required

User · Answer

As far as I can tell  and after taking a look at your file  the problem is that the csv file you re trying to load has multiple tables  There are empty lines  or lines that contain table titles  Try to have a look at this Stackoverflow answer  It shows how to achieve that programmatically   Another dynamic approach to do that would be to use the csv module  read every single row at a time and make sanity checks regular expressions  to infer if the row is  title header values blank   You have one more advantage with this approach  that you can split append collect your data in python objects as desired   The easiest of all would be to use pandas function pd read clipboard   after manually selecting and copying the table to the clipboard  in case you can open the csv in excel or something   Irrelevant   Additionally  irrelevant to your problem  but because no one made mention of this  I had this same issue when loading some datasets such as seeds dataset txt from UCI  In my case  the error was occurring because some separators had more whitespaces than a true tab  t  See line 3 in the following for instance  14 38   14 21   0 8951  5 386   3 312   2 462   4 956   1 14 69   14 49   0 8799  5 563   3 259   3 586   5 219   1 14 11   14 1    0 8911  5 42    3 302   2 7     5       1   Therefore  use  t  in the separator pattern instead of  t   data   pd read csv path  sep   t    header None

User · Answer

It might be an issue with    the delimiters in your data the first row  as  TomAugspurger noted   To solve it  try specifying the sep and or header arguments when calling read csv  For instance    df   pandas read csv fileName  sep  delimiter   header None    In the code above  sep defines your delimiter and header None tells pandas that your source data has no row for headers   column titles  Thus saith the docs   If file contains no header row  then you should explicitly pass header None   In this instance  pandas automatically creates whole-number indices for each field  0 1 2         According to the docs  the delimiter thing should not be an issue  The docs say that  if sep is None  not specified   will try to automatically determine this   I however have not had good luck with this  including instances with obvious delimiters

User · Answer

I have the same problem when read csv  ParserError  Error tokenizing data  I just saved the old csv file to a new csv file  The problem is solved

User · Answer

In my case  it is because the format of the first and last two lines of the csv file is different from the middle content of the file   So what I do is open the csv file as a string  parse the content of the string  then use read csv to get a dataframe   import io import pandas as pd  file   open f  file path   file name     r   content   file read      change new line character from   r n  to   n  lines   content replace   r       split   n      Remove the first and last 2 lines of the file   StringIO can be considered as a file stored in memory df   pd read csv StringIO   n  join lines 2 -2     header None

User · Answer

I had this problem  where I was trying to read in a CSV without passing in column names   df   pd read csv filename  header None    I specified the column names in a list beforehand and then pass them into names  and it solved it immediately  If you don t have set column names  you could just create as many placeholder names as the maximum number of columns that might be in your data   col names     col1    col2    col3        df   pd read csv filename  names col names

User · Answer

I had this problem as well but perhaps for a different reason  I had some trailing commas in my CSV that were adding an additional column that pandas was attempting to read  Using the following works but it simply ignores the bad lines   data   pd read csv  file1 csv   error bad lines False    If you want to keep the lines an ugly kind of hack for handling the errors is to do something like the following   line          expected      saw                cont       True   while cont    True           try          data   pd read csv  file1 csv  skiprows line          cont   False     except Exception as e              errortype   e message split      0  strip                                           if errortype     Error tokenizing data                                      cerror        e message split      1  strip   replace                    nums           n for n in cerror split      if str isdigit n              expected append int nums 0               saw append int nums 2               line append int nums 1  -1           else             cerror         Unknown             print  Unknown Error - 222   if line              Handle the errors however you want   I proceeded to write a script to reinsert the lines into the DataFrame since the bad lines will be given by the variable  line  in the above code  This can all be avoided by simply using the csv reader  Hopefully the pandas developers can make it easier to deal with this situation in the future

User · Answer

following sequence of commands works  I lose the first line of the data -no header None present-  but at least it loads    df   pd read csv filename                    usecols range 0  42   df columns     YR    MO    DAY    HR    MIN    SEC    HUND                            ERROR    RECTYPE    LANE    SPEED    CLASS                            LENGTH    GVW    ESAL    W1    S1    W2    S2                            W3    S3    W4    S4    W5    S5    W6    S6                            W7    S7    W8    S8    W9    S9    W10    S10                            W11    S11    W12    S12    W13    S13    W14    Following does NOT work   df   pd read csv filename                   names   YR    MO    DAY    HR    MIN    SEC    HUND                            ERROR    RECTYPE    LANE    SPEED    CLASS                            LENGTH    GVW    ESAL    W1    S1    W2    S2                            W3    S3    W4    S4    W5    S5    W6    S6                            W7    S7    W8    S8    W9    S9    W10    S10                            W11    S11    W12    S12    W13    S13    W14                     usecols range 0  42     CParserError  Error tokenizing data  C error  Expected 53 fields in line 1605634  saw 54 Following does NOT work   df   pd read csv filename                   header None    CParserError  Error tokenizing data  C error  Expected 53 fields in line 1605634  saw 54  Hence  in your problem you have to pass usecols range 0  2

User · Answer

I came across the same issue  Using pd read table   on the same source file seemed to work   I could not trace the reason for this but it was a useful workaround for my case  Perhaps someone more knowledgeable can shed more light on why it worked   Edit  I found that this error creeps up when you have some text in your file that does not have the same format as the actual data  This is usually header or footer information  greater than one line  so skip header doesn t work  which will not be separated by the same number of commas as your actual data  when using read csv   Using read table uses a tab as the delimiter which could circumvent the users current error but introduce others   I usually get around this by reading the extra data into a file then use the read csv   method   The exact solution might differ depending on your actual file  but this approach has worked for me in several cases

User · Answer

Your CSV file might have variable number of columns and read csv inferred the number of columns from the first few rows  Two ways to solve it in this case   1  Change the CSV file to have a dummy first line with max number of columns  and specify header  0    2  Or use names   list range 0 N   where N is the max number of columns

User · Answer

Error tokenizing data  C error   Expected 2 fields in line 3  saw 12  The error gives a clue to solve the problem   Expected 2 fields in line 3  saw 12   saw 12 means length of the second row is 12 and first row is 2   When you have data like the one shown below  if you skip rows then most of the data will be skipped   data      1 2 3 1 2 3 4 1 2 3 4 5 1 2 1 2 3 4      If you dont want to skip any rows do the following    First lets find the maximum column for all the rows with open  file name csv    r   as temp f        get No of columns in each line     col count     len l split       for l in temp f readlines          Generate column names   names will be 0  1  2       maximum columns - 1  column names    i for i in range max col count      import pandas as pd   inside range set the maximum value you can see in  Expected 4 fields in line 2  saw 8    here will be 8  data   pd read csv  file name csv  header   None names column names     Use range instead of manually setting names as it will be cumbersome when you have many columns    Additionally you can fill up the NaN values with 0  if you need to use even data length  Eg  for clustering  k-means   new data   data fillna 0

User · Answer

use  pandas read csv  CSVFILENAME  header None sep        when trying to read csv data from the link   http   archive ics uci edu ml machine-learning-databases adult adult data   I copied the data from the site into my csvfile  It had extra spaces so used sep       and it worked

User · Answer

I ve had a similar problem while trying to read a tab-delimited table with spaces  commas and quotes   1115794 4218     k  Bacteria    p  Firmicutes    c  Bacilli    o  Bacillales    f  Bacillaceae      1144102 3180     k  Bacteria    p  Firmicutes    c  Bacilli    o  Bacillales    f  Bacillaceae    g  Bacillus      368444  2328     k  Bacteria    p  Bacteroidetes    c  Bacteroidia    o  Bacteroidales    f  Bacteroidaceae    g  Bacteroides         import pandas as pd   Same error for read table counts   pd read csv path counts  sep   t   index col 2  header None  engine    c    pandas io common CParserError  Error tokenizing data  C error  out of memory   This says it has something to do with C parsing engine  which is the default one   Maybe changing to a python one will change anything   counts   pd read table path counts  sep   t   index col 2  header None  engine  python    Segmentation fault  core dumped    Now that is a different error  If we go ahead and try to remove spaces from the table  the error from python-engine changes once again   1115794 4218     k  Bacteria   p  Firmicutes   c  Bacilli   o  Bacillales   f  Bacillaceae     1144102 3180     k  Bacteria   p  Firmicutes   c  Bacilli   o  Bacillales   f  Bacillaceae   g  Bacillus     368444  2328     k  Bacteria   p  Bacteroidetes   c  Bacteroidia   o  Bacteroidales   f  Bacteroidaceae   g  Bacteroides        csv Error        expected after       And it gets clear that pandas was having problems parsing our rows  To parse a table with python engine I needed to remove all spaces and quotes from the table beforehand  Meanwhile C-engine kept crashing even with commas in rows    To avoid creating a new file with replacements I did this  as my tables are small    from io import StringIO with open path counts  as f      input   StringIO f read   replace              replace          replace            replace   0           counts   pd read table input  sep   t   index col 2  header None  engine  python     tl dr  Change parsing engine  try to avoid any non-delimiting quotes commas spaces in your data

User · Answer

The parser is getting confused by the header of the file   It reads the first row and infers the number of columns from that row   But the first two rows aren t representative of the actual data in the file   Try it with data   pd read csv path  skiprows 2

User · Answer

An alternative that I have found to be useful in dealing with similar parsing errors uses the CSV module to re-route data into a pandas df  For example   import csv import pandas as pd path    C  FileLocation   file    filename csv  f   open path file  rt   reader   csv reader f    once contents are available  I then put them in a list csv list      for l in reader      csv list append l  f close    now pandas has no problem getting into a df df   pd DataFrame csv list    I find the CSV module to be a bit more robust to poorly formatted comma separated files and so have had success with this route to address issues like these

User · Answer

I have encountered this error with a stray quotation mark   I use mapping software which will put quotation marks around text items when exporting comma-delimited files   Text which uses quote marks  e g      feet and     inches  can be problematic when then induce delimiter collisions   Consider this example which notes that a 5-inch well log print is poor   UWI key Latitude Longitude Remark US42051316890000 30 4386484 -96 4330734  poor 5    Using 5  as shorthand for 5 inch ends up throwing a wrench in the works  Excel will simply strip off the extra quote mark  but Pandas breaks down without the error bad lines False argument mentioned above

User · Answer

Most of the useful answers are already mentioned  however I suggest saving the pandas dataframes as parquet file  Parquet files don t have this problem and they are memory efficient at the same time

User · Answer

Simple resolution  Open the csv file in excel  amp  save it with different name file of csv format  Again try importing it spyder  Your problem will be resolved

User · Answer

I believe the solutions     engine  python    error bad lines   False   will be good if it is dummy columns and you want to delete it  In my case  the second row really had more columns and I wanted those columns to be integrated and to have the number of columns   MAX columns     Please refer to the solution below that I could not read anywhere   try      df data   pd read csv PATH  header   bl header  sep   str sep  except pd errors ParserError as err      str find    saw       int position   int str err  find str find     len str find      str nbCol   str err  int position       l col   range int str nbCol       df data   pd read csv PATH  header   bl header  sep   str sep  names   l col

User · Answer

The dataset that I used had a lot of quote marks     used extraneous of the formatting  I was able to fix the error by including this parameter for read csv     quoting 3   3 correlates to csv QUOTE NONE for pandas

[python] Python Pandas Error tokenizing data

Examples related to python

Examples related to csv

Examples related to pandas