Reading a huge csv file

Question

I m currently trying to read data from  csv files in Python 2 7 with up to 1 million rows  and 200 columns  files range from 100mb to 1 6gb   I can do this  very slowly  for the files with under 300 000 rows  but once I go above that I get memory errors  My code looks like this   def getdata filename  criteria       data        for criterion in criteria          data append getstuff filename  criteron       return data  def getstuff filename  criterion       import csv     data        with open filename   rb   as csvfile          datareader csv reader csvfile          for row in datareader               if row 3    column header                   data append row              elif len data  lt 2 and row 3   criterion                  pass             elif row 3   criterion                  data append row              else                  return data   The reason for the else clause in the getstuff function is that all the elements which fit the criterion will be listed together in the csv file  so I leave the loop when I get past them to save time   My questions are    How can I manage to get this to work with the bigger files  Is there any way I can make it faster    My computer has 8gb RAM  running 64bit Windows 7  and the processor is 3 40 GHz  not certain what information you need

User · Answer

here s another solution for Python3   import csv with open filename   r   as csvfile      datareader   csv reader csvfile      count   0     for row in datareader          if row 3  in   column header   criterion               doSomething row              count    1         elif count  gt  2              break   here datareader is a generator function

User · Answer

If you are using pandas and have lots of RAM  enough to read the whole file into memory  try using pd read csv with low memory False  e g    import pandas as pd data   pd read csv  file csv   low memory False

User · Answer

I do a fair amount of vibration analysis and look at large data sets  tens and hundreds of millions of points    My testing showed the pandas read csv   function to be 20 times faster than numpy genfromtxt     And the genfromtxt   function is 3 times faster than the numpy loadtxt     It seems that you need pandas for large data sets   I posted the code and data sets I used in this testing on a blog discussing MATLAB vs Python for vibration analysis

User · Answer

For someone who lands to this question  Using pandas with    chunksize    and    usecols    helped me to read a huge zip file faster than the other proposed options   import pandas as pd  sample cols to keep    col 1    col 2    col 3    col 4   col 5      First setup dataframe iterator     usecols    parameter filters the columns  and  chunksize  sets the number of rows per chunk in the csv   you can change these parameters as you wish  df iter   pd read csv     data huge csv file csv gz   compression  gzip   chunksize 20000  usecols sample cols to keep      this list will store the filtered dataframes for later concatenation  df lst          Iterate over the file based on the criteria and append to the list for df  in df iter           tmp df    df  rename columns  col  col lower   for col in df  columns     filter eg  rows where  col 1  value grater than one                                    pipe lambda x   x x col 1  gt  0             df lst     tmp df copy        And finally combine filtered df lst into the final lareger output say  df final  dataframe  df final   pd concat df lst

User · Answer

Although Martijin s answer is prob best  Here is a more intuitive way to process large csv files for beginners  This allows you to process groups of rows  or chunks  at a time    import pandas as pd chunksize   10    8 for chunk in pd read csv filename  chunksize chunksize       process chunk

User · Answer

what worked for me was and is superfast is  import pandas as pd import dask dataframe as dd import time t time clock   df train   dd read csv     data train csv   usecols  col1  col2   df train df train compute   print  load train      time clock  -t    Another working solution is   import pandas as pd  from tqdm import tqdm  PATH       data train csv  chunksize   500000  traintypes      col1   category    col2   str    cols   list traintypes keys     df list        list to hold the batch dataframe  for df chunk in tqdm pd read csv PATH  usecols cols  dtype traintypes  chunksize chunksize          Can process each chunk of dataframe here       clean data    feature engineer   fit          Alternatively  append the chunk to list and merge all     df list append df chunk      Merge all dataframes into one dataframe X   pd concat df list     Delete the dataframe list to release memory del df list del df chunk

User · Answer

You are reading all rows into a list  then processing that list  Don t do that   Process your rows as you produce them  If you need to filter the data first  use a generator function   import csv  def getstuff filename  criterion       with open filename   rb   as csvfile          datareader   csv reader csvfile          yield next datareader     yield the header row         count   0         for row in datareader              if row 3     criterion                  yield row                 count    1             elif count                    done when having read a consecutive series of rows                  return   I also simplified your filter test  the logic is the same but more concise   Because you are only matching a single sequence of rows matching the criterion  you could also use   import csv from itertools import dropwhile  takewhile  def getstuff filename  criterion       with open filename   rb   as csvfile          datareader   csv reader csvfile          yield next datareader     yield the header row           first row  plus any subsequent rows that match  then stop           reading altogether           Python 2  use  for row in takewhile       yield row  instead           instead of  yield from takewhile                yield from takewhile              lambda r  r 3     criterion              dropwhile lambda r  r 3     criterion  datareader           return   You can now loop over getstuff   directly  Do the same in getdata     def getdata filename  criteria       for criterion in criteria          for row in getstuff filename  criterion               yield row   Now loop directly over getdata   in your code   for row in getdata somefilename  sequence of criteria         process row   You now only hold one row in memory  instead of your thousands of lines per criterion   yield makes a function a generator function  which means it won t do any work until you start looping over it

[python] Reading a huge .csv file

Examples related to python

Examples related to python-2.7

Examples related to file

Examples related to csv