[python] How do I read CSV data into a record array in NumPy?

I wonder if there is a direct way to import the contents of a CSV file into a record array, much in the way that R's read.table(), read.delim(), and read.csv() family imports data to R's data frame?

Or is the best way to use csv.reader() and then apply something like numpy.core.records.fromrecords()?

This question is related to python numpy scipy genfromtxt

The answer is


In [329]: %time my_data = genfromtxt('one.csv', delimiter=',')
CPU times: user 19.8 s, sys: 4.58 s, total: 24.4 s
Wall time: 24.4 s

In [330]: %time df = pd.read_csv("one.csv", skiprows=20)
CPU times: user 1.06 s, sys: 312 ms, total: 1.38 s
Wall time: 1.38 s

This work as a charm...

import csv
with open("data.csv", 'r') as f:
    data = list(csv.reader(f, delimiter=";"))

import numpy as np
data = np.array(data, dtype=np.float)

This is the easiest way:

import csv
with open('testfile.csv', newline='') as csvfile:
    data = list(csv.reader(csvfile))

Now each entry in data is a record, represented as an array. So you have a 2D array. It saved me so much time.


As I tried both ways using NumPy and Pandas, using pandas has a lot of advantages:

  • Faster
  • Less CPU usage
  • 1/3 RAM usage compared to NumPy genfromtxt

This is my test code:

$ for f in test_pandas.py test_numpy_csv.py ; do  /usr/bin/time python $f; done
2.94user 0.41system 0:03.05elapsed 109%CPU (0avgtext+0avgdata 502068maxresident)k
0inputs+24outputs (0major+107147minor)pagefaults 0swaps

23.29user 0.72system 0:23.72elapsed 101%CPU (0avgtext+0avgdata 1680888maxresident)k
0inputs+0outputs (0major+416145minor)pagefaults 0swaps

test_numpy_csv.py

from numpy import genfromtxt
train = genfromtxt('/home/hvn/me/notebook/train.csv', delimiter=',')

test_pandas.py

from pandas import read_csv
df = read_csv('/home/hvn/me/notebook/train.csv')

Data file:

du -h ~/me/notebook/train.csv
 59M    /home/hvn/me/notebook/train.csv

With NumPy and pandas at versions:

$ pip freeze | egrep -i 'pandas|numpy'
numpy==1.13.3
pandas==0.20.2

You can also try recfromcsv() which can guess data types and return a properly formatted record array.


Using numpy.loadtxt

A quite simple method. But it requires all the elements being float (int and so on)

import numpy as np 
data = np.loadtxt('c:\\1.csv',delimiter=',',skiprows=0)  

I would suggest using tables (pip3 install tables). You can save your .csv file to .h5 using pandas (pip3 install pandas),

import pandas as pd
data = pd.read_csv("dataset.csv")
store = pd.HDFStore('dataset.h5')
store['mydata'] = data
store.close()

You can then easily, and with less time even for huge amount of data, load your data in a NumPy array.

import pandas as pd
store = pd.HDFStore('dataset.h5')
data = store['mydata']
store.close()

# Data in NumPy format
data = data.values

I timed the

from numpy import genfromtxt
genfromtxt(fname = dest_file, dtype = (<whatever options>))

versus

import csv
import numpy as np
with open(dest_file,'r') as dest_f:
    data_iter = csv.reader(dest_f,
                           delimiter = delimiter,
                           quotechar = '"')
    data = [data for data in data_iter]
data_array = np.asarray(data, dtype = <whatever options>)

on 4.6 million rows with about 70 columns and found that the NumPy path took 2 min 16 secs and the csv-list comprehension method took 13 seconds.

I would recommend the csv-list comprehension method as it is most likely relies on pre-compiled libraries and not the interpreter as much as NumPy. I suspect the pandas method would have similar interpreter overhead.


I tried this:

import pandas as p
import numpy as n

closingValue = p.read_csv("<FILENAME>", usecols=[4], dtype=float)
print(closingValue)

You can use this code to send CSV file data into an array:

import numpy as np
csv = np.genfromtxt('test.csv', delimiter=",")
print(csv)

I would recommend the read_csv function from the pandas library:

import pandas as pd
df=pd.read_csv('myfile.csv', sep=',',header=None)
df.values
array([[ 1. ,  2. ,  3. ],
       [ 4. ,  5.5,  6. ]])

This gives a pandas DataFrame - allowing many useful data manipulation functions which are not directly available with numpy record arrays.

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table...


I would also recommend genfromtxt. However, since the question asks for a record array, as opposed to a normal array, the dtype=None parameter needs to be added to the genfromtxt call:

Given an input file, myfile.csv:

1.0, 2, 3
4, 5.5, 6

import numpy as np
np.genfromtxt('myfile.csv',delimiter=',')

gives an array:

array([[ 1. ,  2. ,  3. ],
       [ 4. ,  5.5,  6. ]])

and

np.genfromtxt('myfile.csv',delimiter=',',dtype=None)

gives a record array:

array([(1.0, 2.0, 3), (4.0, 5.5, 6)], 
      dtype=[('f0', '<f8'), ('f1', '<f8'), ('f2', '<i4')])

This has the advantage that file with multiple data types (including strings) can be easily imported.


Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to numpy

Unable to allocate array with shape and data type How to fix 'Object arrays cannot be loaded when allow_pickle=False' for imdb.load_data() function? Numpy, multiply array with scalar TypeError: only integer scalar arrays can be converted to a scalar index with 1D numpy indices array Could not install packages due to a "Environment error :[error 13]: permission denied : 'usr/local/bin/f2py'" Pytorch tensor to numpy array Numpy Resize/Rescale Image what does numpy ndarray shape do? How to round a numpy array? numpy array TypeError: only integer scalar arrays can be converted to a scalar index

Examples related to scipy

Reading images in python Numpy Resize/Rescale Image How to get the indices list of all NaN value in numpy array? ImportError: cannot import name NUMPY_MKL numpy.where() detailed, step-by-step explanation / examples Scikit-learn train_test_split with indices Matplotlib: Specify format of floats for tick labels Installing NumPy and SciPy on 64-bit Windows (with Pip) Can't install Scipy through pip Plotting a fast Fourier transform in Python

Examples related to genfromtxt

How do I read CSV data into a record array in NumPy?