How to store a dataframe using Pandas

Question

Right now I m importing a fairly large CSV as a dataframe every time I run the script  Is there a good solution for keeping that dataframe constantly available in between runs so I don t have to spend all that time waiting for the script to run

User · Answer

Pandas DataFrames have the to pickle function which is useful for saving a DataFrame   import pandas as pd  a   pd DataFrame   A   0 1 0 1 0   B   True  True  False  False  False    print a      A      B   0  0   True   1  1   True   2  0  False   3  1  False   4  0  False  a to pickle  my file pkl    b   pd read pickle  my file pkl   print b      A      B   0  0   True   1  1   True   2  0  False   3  1  False   4  0  False

User · Answer

If I understand correctly  you re already using pandas read csv   but would like to speed up the development process so that you don t have to load the file in every time you edit your script  is that right  I have a few recommendations    you could load in only part of the CSV file using pandas read csv      nrows 1000  to only load the top bit of the table  while you re doing the development use ipython for an interactive session  such that you keep the pandas table in memory as you edit and reload your script  convert the csv to an HDF5 table updated use DataFrame to feather   and pd read feather   to store data in the R-compatible feather binary format that is super fast  in my hands  slightly faster than pandas to pickle   on numeric data and much faster on string data     You might also be interested in this answer on stackoverflow

User · Answer

Pickle works good   import pandas as pd df to pickle  123 pkl       to save the dataframe  df to 123 pkl df1   pd read pickle  123 pkl    to load 123 pkl back to the dataframe df

User · Answer

pyarrow compatibility across versions  Overall move has been to pyarrow feather  deprecation warnings from pandas msgpack    However I have a challenge with pyarrow with transient in specification Data serialized with pyarrow 0 15 1 cannot be deserialized with 0 16 0 ARROW-7961  I m using serialization to use redis so have to use a binary encoding   I ve retested various options  using jupyter notebook   import sys  pickle  zlib  warnings  io class foocls      def pyarrow out   return pa serialize out  to buffer   to pybytes       def msgpack out   return out to msgpack       def pickle out   return pickle dumps out      def feather out   return out to feather io BytesIO        def parquet out   return out to parquet io BytesIO     warnings filterwarnings  ignore   for c in foocls   dict   values        sbreak   True     try          c out          print c   name     before serialization   sys getsizeof out           print c   name    sys getsizeof c out             timeit -n 50 c out          print c   name     zlib   sys getsizeof zlib compress c out              timeit -n 50 zlib compress c out       except TypeError as e          if  not callable  in str e   sbreak   False         else  raise     except  ValueError  as e  print c   name     ERROR   e      finally           if sbreak  print         30          warnings filterwarnings  default     With following results for my data frame  in out jupyter variable   pyarrow before serialization 533366 pyarrow 120805 1 03 ms    43 9   s per loop  mean    std  dev  of 7 runs  50 loops each  pyarrow zlib 20517 2 78 ms    81 8   s per loop  mean    std  dev  of 7 runs  50 loops each                                                                                             msgpack before serialization 533366 msgpack 109039 1 74 ms    72 8   s per loop  mean    std  dev  of 7 runs  50 loops each  msgpack zlib 16639 3 05 ms    71 7   s per loop  mean    std  dev  of 7 runs  50 loops each                                                                                             pickle before serialization 533366 pickle 142121 733   s    38 3   s per loop  mean    std  dev  of 7 runs  50 loops each  pickle zlib 29477 3 81 ms    60 4   s per loop  mean    std  dev  of 7 runs  50 loops each                                                                                             feather ERROR feather does not support serializing a non-default index for the index  you can  reset index   to make the index into column s                                                                                             parquet ERROR Nested column branch had multiple children  struct lt x  double  y  double gt                                                                                               feather and parquet do not work for my data frame  I m going to continue using pyarrow   However I will supplement with pickle  no compression    When writing to cache store pyarrow and pickle serialised forms   When reading from cache fallback to pickle if pyarrow deserialisation fails

User · Answer

Numpy file formats are pretty fast for numerical data  I prefer to use numpy files since they re fast and easy to work with  Here s a simple benchmark for saving and loading a dataframe with 1 column of 1million points   import numpy as np import pandas as pd  num dict     voltage   np random rand 1000000   num df   pd DataFrame num dict    using ipython s   timeit magic function    timeit with open  num npy    wb   as np file      np save np file  num df    the output is   100 loops  best of 3  5 97 ms per loop   to load the data back into a dataframe    timeit with open  num npy    rb   as np file      data   np load np file   data df   pd DataFrame data    the output is  100 loops  best of 3  5 12 ms per loop   NOT BAD   CONS  There s a problem if you save the numpy file using python 2 and then try opening using python 3  or vice versa

User · Answer

The easiest way is to pickle it using to pickle  df to pickle file name     where to save it  usually as a  pkl  Then you can load it back using  df   pd read pickle file name   Note  before 0 11 1 save and load were the only way to do this  they are now deprecated in favor of to pickle and read pickle respectively    Another popular choice is to use HDF5  pytables  which offers very fast access times for large datasets  import pandas as pd store   pd HDFStore  store h5    store  df     df    save it store  df      load it  More advanced strategies are discussed in the cookbook   Since 0 13 there s also msgpack which may be be better for interoperability  as a faster alternative to JSON  or if you have python object text-heavy data  see this question

User · Answer

Although there are already some answers I found a nice comparison in which they tried several ways to serialize Pandas DataFrames  Efficiently Store Pandas DataFrames    They compare    pickle  original ASCII data format cPickle  a C library pickle-p2  uses the newer binary format json  standardlib json library json-no-index  like json  but without index msgpack  binary JSON alternative CSV hdfstore  HDF5 storage format   In their experiment  they serialize a DataFrame of 1 000 000 rows with the two columns tested separately  one with text data  the other with numbers  Their disclaimer says      You should not trust that what follows generalizes to your data  You should look at your own data and run benchmarks yourself   The source code for the test which they refer to is available online  Since this code did not work directly I made some minor changes  which you can get here  serialize py  I got the following results     They also mention that with the conversion of text data to categorical data the serialization is much faster  In their test about 10 times as fast  also see the test code    Edit  The higher times for pickle than CSV can be explained by the data format used  By default pickle uses a printable ASCII representation  which generates larger data sets  As can be seen from the graph however  pickle using the newer binary data format  version 2  pickle-p2  has much lower load times   Some other references    In the question Fastest Python library to read a CSV file there is a very detailed answer which compares different libraries to read csv files with a benchmark  The result is that for reading csv files numpy fromfile is the fastest  Another serialization test shows msgpack  ujson  and cPickle to be the quickest in serializing

User · Answer

Arctic is a high performance datastore for Pandas  numpy and other numeric data  It sits on top of MongoDB  Perhaps overkill for the OP  but worth mentioning for other folks stumbling across this post

User · Answer

As already mentioned there are different options and file formats  HDF5  JSON  CSV  parquet  SQL  to store a data frame  However  pickle is not a first-class citizen  depending on your setup   because    pickle is a potential security risk  Form the Python documentation for pickle       Warning The pickle module is not secure against erroneous or   maliciously constructed data  Never unpickle data received from an   untrusted or unauthenticated source     pickle is slow  Find here and here benchmarks    Depending on your setup usage both limitations do not apply  but I would not recommend pickle as the default persistence for pandas data frames

User · Answer

You can use feather format file  It is extremely fast   df to feather  filename ft

User · Answer

https   docs python org 3 library pickle html  The pickle protocol formats   Protocol version 0 is the original    human-readable    protocol and is backwards compatible with earlier versions of Python   Protocol version 1 is an old binary format which is also compatible with earlier versions of Python   Protocol version 2 was introduced in Python 2 3  It provides much more efficient pickling of new-style classes  Refer to PEP 307 for information about improvements brought by protocol 2   Protocol version 3 was added in Python 3 0  It has explicit support for bytes objects and cannot be unpickled by Python 2 x  This is the default protocol  and the recommended protocol when compatibility with other Python 3 versions is required   Protocol version 4 was added in Python 3 4  It adds support for very large objects  pickling more kinds of objects  and some data format optimizations  Refer to PEP 3154 for information about improvements brought by protocol 4

User · Answer

Another quite fresh test with to pickle    I have 25  csv files in total to process and the final dataframe consists of roughly 2M items   Note  Besides loading the  csv files  I also manipulate some data and extend the data frame by new columns   Going through all 25  csv files and create the dataframe takes around 14 sec  Loading the whole dataframe from a pkl file takes less than 1 sec

[python] How to store a dataframe using Pandas

Examples related to python

Examples related to pandas

Examples related to dataframe