[python] Reading a huge .csv file

I'm currently trying to read data from .csv files in Python 2.7 with up to 1 million rows, and 200 columns (files range from 100mb to 1.6gb). I can do this (very slowly) for the files with under 300,000 rows, but once I go above that I get memory errors. My code looks like this:

def getdata(filename, criteria):
    data=[]
    for criterion in criteria:
        data.append(getstuff(filename, criteron))
    return data

def getstuff(filename, criterion):
    import csv
    data=[]
    with open(filename, "rb") as csvfile:
        datareader=csv.reader(csvfile)
        for row in datareader: 
            if row[3]=="column header":
                data.append(row)
            elif len(data)<2 and row[3]!=criterion:
                pass
            elif row[3]==criterion:
                data.append(row)
            else:
                return data

The reason for the else clause in the getstuff function is that all the elements which fit the criterion will be listed together in the csv file, so I leave the loop when I get past them to save time.

My questions are:

  1. How can I manage to get this to work with the bigger files?

  2. Is there any way I can make it faster?

My computer has 8gb RAM, running 64bit Windows 7, and the processor is 3.40 GHz (not certain what information you need).

This question is related to python python-2.7 file csv

The answer is


here's another solution for Python3:

import csv
with open(filename, "r") as csvfile:
    datareader = csv.reader(csvfile)
    count = 0
    for row in datareader:
        if row[3] in ("column header", criterion):
            doSomething(row)
            count += 1
        elif count > 2:
            break

here datareader is a generator function.


Although Martijin's answer is prob best. Here is a more intuitive way to process large csv files for beginners. This allows you to process groups of rows, or chunks, at a time.

import pandas as pd
chunksize = 10 ** 8
for chunk in pd.read_csv(filename, chunksize=chunksize):
    process(chunk)

I do a fair amount of vibration analysis and look at large data sets (tens and hundreds of millions of points). My testing showed the pandas.read_csv() function to be 20 times faster than numpy.genfromtxt(). And the genfromtxt() function is 3 times faster than the numpy.loadtxt(). It seems that you need pandas for large data sets.

I posted the code and data sets I used in this testing on a blog discussing MATLAB vs Python for vibration analysis.


For someone who lands to this question. Using pandas with ‘chunksize’ and ‘usecols’ helped me to read a huge zip file faster than the other proposed options.

import pandas as pd

sample_cols_to_keep =['col_1', 'col_2', 'col_3', 'col_4','col_5']

# First setup dataframe iterator, ‘usecols’ parameter filters the columns, and 'chunksize' sets the number of rows per chunk in the csv. (you can change these parameters as you wish)
df_iter = pd.read_csv('../data/huge_csv_file.csv.gz', compression='gzip', chunksize=20000, usecols=sample_cols_to_keep) 

# this list will store the filtered dataframes for later concatenation 
df_lst = [] 

# Iterate over the file based on the criteria and append to the list
for df_ in df_iter: 
        tmp_df = (df_.rename(columns={col: col.lower() for col in df_.columns}) # filter eg. rows where 'col_1' value grater than one
                                  .pipe(lambda x:  x[x.col_1 > 0] ))
        df_lst += [tmp_df.copy()] 

# And finally combine filtered df_lst into the final lareger output say 'df_final' dataframe 
df_final = pd.concat(df_lst)

If you are using pandas and have lots of RAM (enough to read the whole file into memory) try using pd.read_csv with low_memory=False, e.g.:

import pandas as pd
data = pd.read_csv('file.csv', low_memory=False)

what worked for me was and is superfast is

import pandas as pd
import dask.dataframe as dd
import time
t=time.clock()
df_train = dd.read_csv('../data/train.csv', usecols=[col1, col2])
df_train=df_train.compute()
print("load train: " , time.clock()-t)

Another working solution is:

import pandas as pd 
from tqdm import tqdm

PATH = '../data/train.csv'
chunksize = 500000 
traintypes = {
'col1':'category',
'col2':'str'}

cols = list(traintypes.keys())

df_list = [] # list to hold the batch dataframe

for df_chunk in tqdm(pd.read_csv(PATH, usecols=cols, dtype=traintypes, chunksize=chunksize)):
    # Can process each chunk of dataframe here
    # clean_data(), feature_engineer(),fit()

    # Alternatively, append the chunk to list and merge all
    df_list.append(df_chunk) 

# Merge all dataframes into one dataframe
X = pd.concat(df_list)

# Delete the dataframe list to release memory
del df_list
del df_chunk

Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to python-2.7

Numpy, multiply array with scalar Not able to install Python packages [SSL: TLSV1_ALERT_PROTOCOL_VERSION] How to create a new text file using Python Could not find a version that satisfies the requirement tensorflow Python: Pandas pd.read_excel giving ImportError: Install xlrd >= 0.9.0 for Excel support Display/Print one column from a DataFrame of Series in Pandas How to calculate 1st and 3rd quartiles? How can I read pdf in python? How to completely uninstall python 2.7.13 on Ubuntu 16.04 Check key exist in python dict

Examples related to file

Gradle - Move a folder from ABC to XYZ Difference between opening a file in binary vs text Angular: How to download a file from HttpClient? Python error message io.UnsupportedOperation: not readable java.io.FileNotFoundException: class path resource cannot be opened because it does not exist Writing JSON object to a JSON file with fs.writeFileSync How to read/write files in .Net Core? How to write to a CSV line by line? Writing a dictionary to a text file? What are the pros and cons of parquet format compared to other formats?

Examples related to csv

Pandas: ValueError: cannot convert float NaN to integer Export result set on Dbeaver to CSV Convert txt to csv python script How to import an Excel file into SQL Server? "CSV file does not exist" for a filename with embedded quotes Save Dataframe to csv directly to s3 Python Data-frame Object has no Attribute (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape How to write to a CSV line by line? How to check encoding of a CSV file