Pandas read csv low memory and dtype options

Question

When calling  df   pd read csv  somefile csv     I get       Users josh anaconda envs py27 lib python2 7 site-packages pandas io parsers py 1130    DtypeWarning  Columns  4 5 7 16  have mixed types   Specify dtype   option on import or set low memory False    Why is the dtype option related to low memory  and why would making it False help with this problem

User · Answer

Try   dashboard df   pd read csv p file  sep      error bad lines False  index col False  dtype  unicode     According to the pandas documentation      dtype   Type name or dict of column -  type   As for low memory  it s True by default and isn t yet documented  I don t think its relevant though  The error message is generic  so you shouldn t need to mess with low memory anyway  Hope this helps and let me know if you have further problems

User · Answer

As mentioned earlier by firelynx if dtype is explicitly specified and there is mixed data that is not compatible with that dtype then loading will crash  I used a converter like this as a workaround to change the values with incompatible data type so that the data could still be loaded   def conv val       if not val          return 0         try          return np float64 val      except                  return np float64 0   df   pd read csv csv file converters   COL A  conv  COL B  conv

User · Answer

I was facing a similar issue when processing a huge csv file  6 million rows   I had three issues   the file contained strange characters  fixed using encoding  the datatype was not specified  fixed using dtype property  Using the above I still faced an issue which was related with the file format that could not be defined based on the filename  fixed using try    except         df   pd read csv csv file sep      encoding    ISO-8859-1                        names   permission   owner name   group name   size   ctime   mtime   atime   filename   full filename                         dtype   permission  str  owner name  str  group name  str  size  str  ctime  object  mtime  object  atime  object  filename  str  full filename  str  first date  object  last date  object            try          df  file format      Path f  suffix 1   for f in df filename tolist        except          df  file format

User · Answer

According to the pandas documentation  specifying low memory False as long as the engine  c   which is the default  is a reasonable solution to this problem  If low memory False  then whole columns will be read in first  and then the proper types determined  For example  the column will be kept as objects  strings  as needed to preserve information  If low memory True  the default   then pandas reads in the data in chunks of rows  then appends them together  Then some of the columns might look like chunks of integers and strings mixed up  depending on whether during the chunk pandas encountered anything that couldn t be cast to integer  say   This could cause problems later  The warning is telling you that this happened at least once in the read in  so you should be careful  Setting low memory False will use more memory but will avoid the problem  Personally  I think low memory True is a bad default  but I work in an area that uses many more small datasets than large ones and so convenience is more important than efficiency  The following code illustrates an example where low memory True is set and a column comes in with mixed types  It builds off the answer by  firelynx import pandas as pd try      from StringIO import StringIO except ImportError      from io import StringIO    make a big csv data file  following earlier approach by  firelynx csvdata    quot  quot  quot 1 Alice 2 Bob 3 Caesar  quot  quot  quot     we have to replicate the  quot integer column quot  user id many many times to get   pd read csv to actually chunk read  otherwise it just reads    the whole thing in one chunk  because it s faster  and we don t get any     quot mixed dtype quot  issue  the 100000 below was chosen by experimentation  csvdatafull    quot  quot  for i in range 100000       csvdatafull   csvdatafull   csvdata csvdatafull    csvdatafull    quot foobar Cthlulu n quot  csvdatafull    quot user id username n quot    csvdatafull  sio   StringIO csvdatafull    the following line gives me the warning        C  Users rdisa anaconda3 lib site-packages IPython core interactiveshell py 3072  DtypeWarning  Columns  0  have mixed types Specify dtype option on import or set low memory False        interactivity interactivity  compiler compiler  result result    but it does not always give me the warning  so i guess the internal workings of read csv depend on background factors x   pd read csv sio  low memory True     dtype   quot user id quot   int   quot username quot    quot string quot     x dtypes   this gives    Out 69      user id     object   username    object   dtype  object  type x  user id   iloc 0     int type x  user id   iloc 1     int type x  user id   iloc 2     int type x  user id   iloc 10000     int type x  user id   iloc 299999     str       even though it s a number  so this chunk must have been read in as strings  type x  user id   iloc 300000     str         Aside  To give an example where this is a problem  and where I first encountered this as a serious issue   imagine you ran pd read csv   on a file then wanted to drop duplicates based on an identifier  Say the identifier is sometimes numeric  sometimes string  One row might be  quot 81287 quot   another might be  quot 97324-32 quot   Still  they are unique identifiers  With low memory True  pandas might read in the identifier column like this  81287 81287 81287 81287 81287  quot 81287 quot   quot 81287 quot   quot 81287 quot   quot 81287 quot   quot 97324-32 quot   quot 97324-32 quot   quot 97324-32 quot   quot 97324-32 quot   quot 97324-32 quot   Just because it chunks things and so  sometimes the identifier 81287 is a number  sometimes a string  When I try to drop duplicates based on this  well  81287     quot 81287 quot  Out 98   False

User · Answer

I had a similar issue with a  400MB file  Setting low memory False did the trick for me  Do the simple things first I would check that your dataframe isn t bigger than your system memory  reboot  clear the RAM before proceeding  If you re still running into errors  its worth making sure your  csv file is ok  take a quick look in Excel and make sure there s no obvious corruption  Broken original data can wreak havoc

User · Answer

df   pd read csv  somefile csv   low memory False    This should solve the issue  I got exactly the same error  when reading 1 8M rows from a CSV

User · Answer

Sometimes  when all else fails  you just want to tell pandas to shut up about it    Ignore DtypeWarnings from pandas  read csv                                                                                                                                                                                             warnings filterwarnings  ignore   message  quot  Columns   quot

User · Answer

It worked for me with low memory   False while importing a DataFrame  That is all the change that worked for me   df   pd read csv  export4 16 csv  low memory False

User · Answer

As the error says  you should specify the datatypes when using the read csv   method  So  you should write file   pd read csv  example csv   dtype  unicode

User · Answer

The deprecated low memory option The low memory option is not properly deprecated  but it should be  since it does not actually do anything differently source  The reason you get this low memory warning is because guessing dtypes for each column is very memory demanding  Pandas tries to determine what dtype to set by analyzing the data in each column  Dtype Guessing  very bad  Pandas can only determine what dtype a column should have once the whole file is read  This means nothing can really be parsed before the whole file is read unless you risk having to change the dtype of that column when you read the last value  Consider the example of one file which has a column called user id  It contains 10 million rows where the user id is always numbers  Since pandas cannot know it is only numbers  it will probably keep it as the original strings until it has read the whole file  Specifying dtypes  should always be done  adding dtype   user id   int   to the pd read csv   call will make pandas know when it starts reading the file  that this is only integers  Also worth noting is that if the last line in the file would have  quot foobar quot  written in the user id column  the loading would crash if the above dtype was specified  Example of broken data that breaks when dtypes are defined import pandas as pd try      from StringIO import StringIO except ImportError      from io import StringIO   csvdata    quot  quot  quot user id username 1 Alice 3 Bob foobar Caesar quot  quot  quot  sio   StringIO csvdata  pd read csv sio  dtype   quot user id quot   int   quot username quot    quot string quot     ValueError  invalid literal for long   with base 10   foobar   dtypes are typically a numpy thing  read more about them here  http   docs scipy org doc numpy reference generated numpy dtype html What dtypes exists  We have access to numpy dtypes  float  int  bool  timedelta64 ns  and datetime64 ns   Note that the numpy date time dtypes are not time zone aware  Pandas extends this set of dtypes with its own   datetime64 ns     Which is a time zone aware timestamp   category  which is essentially an enum  strings represented by integer keys to save  period    Not to be confused with a timedelta  these objects are actually anchored to specific time periods  Sparse    Sparse int     Sparse float   is for sparse data or  Data that has a lot of holes in it  Instead of saving the NaN or None in the dataframe it omits the objects  saving space   Interval  is a topic of its own but its main use is for indexing  See more here  Int8    Int16    Int32    Int64    UInt8    UInt16    UInt32    UInt64  are all pandas specific integers that are nullable  unlike the numpy variant   string  is a specific dtype for working with string data and gives access to the  str attribute on the series   boolean  is like the numpy  bool  but it also supports missing data  Read the complete reference here  Pandas dtype reference Gotchas  caveats  notes Setting dtype object will silence the above warning  but will not make it more memory efficient  only process efficient if anything  Setting dtype unicode will not do anything  since to numpy  a unicode is represented as object  Usage of converters  sparrow correctly points out the usage of converters to avoid pandas blowing up when encountering  foobar  in a column specified as int  I would like to add that converters are really heavy and inefficient to use in pandas and should be used as a last resort  This is because the read csv process is a single process  CSV files can be processed line by line and thus can be processed by multiple converters in parallel more efficiently by simply cutting the file into segments and running multiple processes  something that pandas does not support  But this is a different story

[python] Pandas read_csv low_memory and dtype options

Examples related to python

Examples related to parsing

Examples related to numpy

Examples related to pandas

Examples related to dataframe