How do I release memory used by a pandas dataframe

Question

I have a really large csv file that I opened in pandas as follows      import pandas df   pandas read csv  large txt file txt     Once I do this my memory usage increases by 2GB  which is expected because this file contains millions of rows   My problem comes when I need to release this memory   I ran      del df   However  my memory usage did not drop   Is this the wrong approach to release memory used by a pandas data frame   If it is  what is the proper way

User · Answer

As noted in the comments  there are some things to try  gc collect   EdChum  may clear stuff  for example  At least from my experience  these things sometimes work and often don t    There is one thing that always works  however  because it is done at the OS  not language  level   Suppose you have a function that creates an intermediate huge DataFrame  and returns a smaller result  which might also be a DataFrame    def huge intermediate calc something               huge df   pd DataFrame                  return some aggregate   Then if you do something like  import multiprocessing  result   multiprocessing Pool 1  map huge intermediate calc   something    0    Then the function is executed at a different process  When that process completes  the OS retakes all the resources it used  There s really nothing Python  pandas  the garbage collector  could do to stop that

User · Answer

Reducing memory usage in Python is difficult  because Python does not actually release memory back to the operating system  If you delete objects  then the memory is available to new Python objects  but not free   d back to the system  see this question    If you stick to numeric numpy arrays  those are freed  but boxed objects are not    gt  gt  gt  import os  psutil  numpy as np  gt  gt  gt  def usage            process   psutil Process os getpid            return process get memory info   0    float 2    20        gt  gt  gt  usage     initial memory usage 27 5    gt  gt  gt  arr   np arange 10    8    create a large array without boxing  gt  gt  gt  usage   790 46875  gt  gt  gt  del arr  gt  gt  gt  usage   27 52734375   numpy just free   d the array   gt  gt  gt  arr   np arange 10    8  dtype  O     create lots of objects  gt  gt  gt  usage   3135 109375  gt  gt  gt  del arr  gt  gt  gt  usage   2372 16796875    numpy frees the array  but python keeps the heap big   Reducing the Number of Dataframes  Python keep our memory at high watermark  but we can reduce the total number of dataframes we create  When modifying your dataframe  prefer inplace True  so you don t create copies   Another common gotcha is holding on to copies of previously created dataframes in ipython   In  1   import pandas as pd  In  2   df   pd DataFrame   foo    1 2 3 4     In  3   df   1 Out 3       foo 0    2 1    3 2    4 3    5  In  4   df   2 Out 4       foo 0    3 1    4 2    5 3    6  In  5   Out   Still has all our temporary DataFrame objects  Out 5     3     foo  0    2  1    3  2    4  3    5  4     foo  0    3  1    4  2    5  3    6    You can fix this by typing  reset Out to clear your history  Alternatively  you can adjust how much history ipython keeps with ipython --cache-size 5  default is 1000    Reducing Dataframe Size  Wherever possible  avoid using object dtypes    gt  gt  gt  df dtypes foo    float64   8 bytes per value bar      int64   8 bytes per value baz     object   at least 48 bytes per value  often more   Values with an object dtype are boxed  which means the numpy array just contains a pointer and you have a full Python object on the heap for every value in your dataframe  This includes strings   Whilst numpy supports fixed-size strings in arrays  pandas does not  it s caused user confusion   This can make a significant difference    gt  gt  gt  import numpy as np  gt  gt  gt  arr   np array   foo    bar    baz     gt  gt  gt  arr dtype dtype  S3    gt  gt  gt  arr nbytes 9   gt  gt  gt  import sys  import pandas as pd  gt  gt  gt  s   pd Series   foo    bar    baz    dtype  O    gt  gt  gt  sum sys getsizeof x  for x in s  120   You may want to avoid using string columns  or find a way of representing string data as numbers   If you have a dataframe that contains many repeated values  NaN is very common   then you can use a sparse data structure to reduce memory usage    gt  gt  gt  df1 info    lt class  pandas core frame DataFrame  gt  Int64Index  39681584 entries  0 to 39681583 Data columns  total 1 columns   foo    float64 dtypes  float64 1  memory usage  605 5 MB   gt  gt  gt  df1 shape  39681584  1    gt  gt  gt  df1 foo isnull   sum     100    len df1  20 628483479893344   so 20  of values are NaN   gt  gt  gt  df1 to sparse   info    lt class  pandas sparse frame SparseDataFrame  gt  Int64Index  39681584 entries  0 to 39681583 Data columns  total 1 columns   foo    float64 dtypes  float64 1  memory usage  543 0 MB   Viewing Memory Usage  You can view the memory usage  docs     gt  gt  gt  df info    lt class  pandas core frame DataFrame  gt  Int64Index  39681584 entries  0 to 39681583 Data columns  total 14 columns       dtypes  datetime64 ns  1   float64 8   int64 1   object 4  memory usage  4 4  GB   As of pandas 0 17 1  you can also do df info memory usage  deep   to see memory usage including objects

User · Answer

del df will not be deleted if there are any reference to the df at the time of deletion  So you need to to delete all the references to it with del df to release the memory    So all the instances bound to df should be deleted to trigger garbage collection   Use objgragh to check which is holding onto the objects

User · Answer

It seems there is an issue with glibc that affects the memory allocation in Pandas  https   github com pandas-dev pandas issues 2659 The monkey patch detailed on this issue has resolved the problem for me    monkeypatches py    Solving memory leak problem in pandas   https   github com pandas-dev pandas issues 2659 issuecomment-12021083 import pandas as pd from ctypes import cdll  CDLL try      cdll LoadLibrary  quot libc so 6 quot       libc   CDLL  quot libc so 6 quot       libc malloc trim 0  except  OSError  AttributeError       libc   None    old del   getattr pd DataFrame     del     None   def   new del self       if   old del            old del self      libc malloc trim 0   if libc      print  Applying monkeypatch for pd DataFrame   del     file sys stderr      pd DataFrame   del       new del else      print  Skipping monkeypatch for pd DataFrame   del    libc or malloc trim   not found   file sys stderr

User · Answer

This solves the problem of releasing the memory for me    import gc import pandas as pd  del   df 1 df 2   gc collect   df 1 pd DataFrame   df 2 pd DataFrame    the data-frame will be explicitly set to null in the above statements Firstly  the self reference of the dataframe is deleted meaning the dataframe is no longer available to python there after all the references of the dataframe is collected by garbage collector  gc collect    and then explicitly set all the references to empty dataframe  more on the working of garbage collector is well explained in https   stackify com python-garbage-collection

[python] How do I release memory used by a pandas dataframe?

Examples related to python

Examples related to pandas

Examples related to memory