best way to preserve numpy arrays on disk

Question

I am looking for a fast way to preserve large numpy arrays  I want to save them to the disk in a binary format  then read them back into memory relatively fastly  cPickle is not fast enough  unfortunately    I found numpy savez and numpy load  But the weird thing is  numpy load loads a npy file into  memory-map   That means regular manipulating of arrays really slow  For example  something like this would be really slow       usr bin python import numpy as np  import time   from tempfile import TemporaryFile  n   10000000   a   np arange n  b   np arange n    10 c   np arange n    -0 5  file   TemporaryFile   np savez file a   a  b   b  c   c    file seek 0  t   time time   z   np load file  print  loading time      time time   - t  t   time time   aa   z  a   bb   z  b   cc   z  c   print  assigning time      time time   - t    more precisely  the first line will be really fast  but the remaining lines that assign the arrays to obj are ridiculously slow   loading time    0 000220775604248 assining time    2 72940087318   Is there any better way of preserving numpy arrays  Ideally  I want to be able to store multiple arrays in one file

User · Answer

The lookup time is slow because when you use mmap to does not load content of array to memory when you invoke load method. Data is lazy loaded when particular data is needed. And this happens in lookup in your case. But second lookup won`t be so slow.

This is nice feature of mmap when you have a big array you do not have to load whole data into memory.

To solve your can use joblib you can dump any object you want using joblib.dump even two or more numpy arrays, see the example

firstArray = np.arange(100)
secondArray = np.arange(50)
# I will put two arrays in dictionary and save to one file
my_dict = {'first' : firstArray, 'second' : secondArray}
joblib.dump(my_dict, 'file_name.dat')

User · Answer

I m a big fan of hdf5 for storing large numpy arrays  There are two options for dealing with hdf5 in python   http   www pytables org   http   www h5py org   Both are designed to work with numpy arrays efficiently

User · Answer

There is now a HDF5 based clone of pickle called hickle   https   github com telegraphic hickle  import hickle as hkl   data      name     test    data arr     1  2  3  4       Dump data to file hkl dump  data   new data file hkl       Load data from file data2   hkl load   new data file hkl     print  data    data2       EDIT   There also is the possibility to  pickle  directly into a compressed archive by doing   import pickle  gzip  lzma  bz2  pickle dump  data  gzip open   data pkl gz      wb      pickle dump  data  lzma open   data pkl lzma    wb      pickle dump  data   bz2 open   data pkl bz2     wb            Appendix  import numpy as np import matplotlib pyplot as plt import pickle  os  time import gzip  lzma  bz2  h5py  compressions      pickle    h5py    gzip    lzma    bz2    labels      pickle    h5py    pickle gzip    pickle lzma    pickle bz2    size   1000  data         Random data data  random     np random random  size  size      Not that random data data  semi-random     np zeros  size  size   for i in range size       for j in range size           data  semi-random   i j    np sum data  random   i       np sum data  random     j      Not random data data  not-random     np arange  size size  dtype np float64   reshape   size  size     sizes       for key in data       sizes key            for compression in compressions           if compression     pickle               time start   time time               pickle dump  data key   open   data pkl    wb                  time tot   time time   - time start             sizes key   pickle       os path getsize   data pkl      10   -6   time tot               os remove   data pkl             elif compression     h5py               time start   time time               with h5py File   data pkl     format compression    w    as h5f                  h5f create dataset  data   data data key               time tot   time time   - time start             sizes key  compression      os path getsize   data pkl     format compression      10   -6   time tot              os remove   data pkl     format compression             else              time start   time time               pickle dump  data key   eval compression  open   data pkl     format compression    wb                  time tot   time time   - time start             sizes key   labels  compressions index compression          os path getsize   data pkl     format compression      10   -6   time tot               os remove   data pkl     format compression      f  ax size   plt subplots   ax time   ax size twinx    x ticks   labels x   np arange  len x ticks     y size      y time      for key in data      y size key      sizes key   x ticks i    0  for i in x       y time key      sizes key   x ticks i    1  for i in x    width    2 viridis   plt cm viridis  p1   ax size bar  x-width  y size  random           width  color   viridis 0     p2   ax size bar  x        y size  semi-random      width  color   viridis  45   p3   ax size bar  x width  y size  not-random       width  color   viridis  9     p4   ax time bar  x-width  y time  random       02  color    red   ax time bar  x        y time  semi-random       02  color    red   ax time bar  x width  y time  not-random        02  color    red    ax size legend   p1  p2  p3  p4     random    semi-random    not-random    saving time    loc  upper center  bbox to anchor   5  - 1   ncol 4   ax size set xticks  x   ax size set xticklabels  x ticks    f suptitle   Pickle Compression Comparison    ax size set ylabel   Size  MB     ax time set ylabel   Time  s      f savefig   sizes pdf   bbox inches  tight

User · Answer

savez   save data in a zip file  It may take some time to zip  amp  unzip the file  You can use save    amp  load   function   f   file  tmp bin   wb   np save f a  np save f b  np save f c  f close    f   file  tmp bin   rb   aa   np load f  bb   np load f  cc   np load f  f close     To save multiple arrays in one file  you just need to open the file first  and then save or load the arrays in sequence

User · Answer

I ve compared performance  space and time  for a number of ways to store numpy arrays  Few of them support multiple arrays per file  but perhaps it s useful anyway     Npy and binary files are both really fast and small for dense data  If the data is sparse or very structured  you might want to use npz with compression  which ll save a lot of space but cost some load time   If portability is an issue  binary is better than npy  If human readability is important  then you ll have to sacrifice a lot of performance  but it can be achieved fairly well using csv  which is also very portable of course    More details and the code are available at the github repo

User · Answer

Another possibility to store numpy arrays efficiently is Bloscpack      usr bin python import numpy as np import bloscpack as bp import time  n   10000000  a   np arange n  b   np arange n    10 c   np arange n    -0 5 tsizeMB   sum i size i itemsize for i in  a b c     2  20   blosc args   bp DEFAULT BLOSC ARGS blosc args  clevel     6 t   time time   bp pack ndarray file a   a blp   blosc args blosc args  bp pack ndarray file b   b blp   blosc args blosc args  bp pack ndarray file c   c blp   blosc args blosc args  t1   time time   - t print  store time     2f    2f MB s      t1  tsizeMB   t1   t   time time   a1   bp unpack ndarray file  a blp   b1   bp unpack ndarray file  b blp   c1   bp unpack ndarray file  c blp   t1   time time   - t print  loading time     2f    2f MB s      t1  tsizeMB   t1    and the output for my laptop  a relatively old MacBook Air with a Core2 processor      python store-blpk py store time   0 19  1216 45 MB s  loading time   0 25  898 08 MB s    that means that it can store really fast  i e  the bottleneck is typically the disk   However  as the compression ratios are pretty good here  the effective speed is multiplied by the compression ratios   Here are the sizes for these 76 MB arrays     ll -h   blp -rw-r--r--  1 faltet  staff   921K Mar  6 13 50 a blp -rw-r--r--  1 faltet  staff   2 2M Mar  6 13 50 b blp -rw-r--r--  1 faltet  staff   1 4M Mar  6 13 50 c blp   Please note that the use of the Blosc compressor is fundamental for achieving this   The same script but using  clevel    0  i e  disabling compression      python bench store-blpk py store time   3 36  68 04 MB s  loading time   2 61  87 80 MB s    is clearly bottlenecked by the disk performance

[python] best way to preserve numpy arrays on disk

Examples related to python

Examples related to numpy

Examples related to pickle

Examples related to binary-data

Examples related to preserve