[python] How do I release memory used by a pandas dataframe?

I have a really large csv file that I opened in pandas as follows....

import pandas
df = pandas.read_csv('large_txt_file.txt')

Once I do this my memory usage increases by 2GB, which is expected because this file contains millions of rows. My problem comes when I need to release this memory. I ran....

del df

However, my memory usage did not drop. Is this the wrong approach to release memory used by a pandas data frame? If it is, what is the proper way?

This question is related to python pandas memory

The answer is


This solves the problem of releasing the memory for me!!!

import gc
import pandas as pd

del [[df_1,df_2]]
gc.collect()
df_1=pd.DataFrame()
df_2=pd.DataFrame()

the data-frame will be explicitly set to null

in the above statements

Firstly, the self reference of the dataframe is deleted meaning the dataframe is no longer available to python there after all the references of the dataframe is collected by garbage collector (gc.collect()) and then explicitly set all the references to empty dataframe.

more on the working of garbage collector is well explained in https://stackify.com/python-garbage-collection/


del df will not be deleted if there are any reference to the df at the time of deletion. So you need to to delete all the references to it with del df to release the memory.

So all the instances bound to df should be deleted to trigger garbage collection.

Use objgragh to check which is holding onto the objects.


It seems there is an issue with glibc that affects the memory allocation in Pandas: https://github.com/pandas-dev/pandas/issues/2659

The monkey patch detailed on this issue has resolved the problem for me:

# monkeypatches.py

# Solving memory leak problem in pandas
# https://github.com/pandas-dev/pandas/issues/2659#issuecomment-12021083
import pandas as pd
from ctypes import cdll, CDLL
try:
    cdll.LoadLibrary("libc.so.6")
    libc = CDLL("libc.so.6")
    libc.malloc_trim(0)
except (OSError, AttributeError):
    libc = None

__old_del = getattr(pd.DataFrame, '__del__', None)

def __new_del(self):
    if __old_del:
        __old_del(self)
    libc.malloc_trim(0)

if libc:
    print('Applying monkeypatch for pd.DataFrame.__del__', file=sys.stderr)
    pd.DataFrame.__del__ = __new_del
else:
    print('Skipping monkeypatch for pd.DataFrame.__del__: libc or malloc_trim() not found', file=sys.stderr)

Reducing memory usage in Python is difficult, because Python does not actually release memory back to the operating system. If you delete objects, then the memory is available to new Python objects, but not free()'d back to the system (see this question).

If you stick to numeric numpy arrays, those are freed, but boxed objects are not.

>>> import os, psutil, numpy as np
>>> def usage():
...     process = psutil.Process(os.getpid())
...     return process.get_memory_info()[0] / float(2 ** 20)
... 
>>> usage() # initial memory usage
27.5 

>>> arr = np.arange(10 ** 8) # create a large array without boxing
>>> usage()
790.46875
>>> del arr
>>> usage()
27.52734375 # numpy just free()'d the array

>>> arr = np.arange(10 ** 8, dtype='O') # create lots of objects
>>> usage()
3135.109375
>>> del arr
>>> usage()
2372.16796875  # numpy frees the array, but python keeps the heap big

Reducing the Number of Dataframes

Python keep our memory at high watermark, but we can reduce the total number of dataframes we create. When modifying your dataframe, prefer inplace=True, so you don't create copies.

Another common gotcha is holding on to copies of previously created dataframes in ipython:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'foo': [1,2,3,4]})

In [3]: df + 1
Out[3]: 
   foo
0    2
1    3
2    4
3    5

In [4]: df + 2
Out[4]: 
   foo
0    3
1    4
2    5
3    6

In [5]: Out # Still has all our temporary DataFrame objects!
Out[5]: 
{3:    foo
 0    2
 1    3
 2    4
 3    5, 4:    foo
 0    3
 1    4
 2    5
 3    6}

You can fix this by typing %reset Out to clear your history. Alternatively, you can adjust how much history ipython keeps with ipython --cache-size=5 (default is 1000).

Reducing Dataframe Size

Wherever possible, avoid using object dtypes.

>>> df.dtypes
foo    float64 # 8 bytes per value
bar      int64 # 8 bytes per value
baz     object # at least 48 bytes per value, often more

Values with an object dtype are boxed, which means the numpy array just contains a pointer and you have a full Python object on the heap for every value in your dataframe. This includes strings.

Whilst numpy supports fixed-size strings in arrays, pandas does not (it's caused user confusion). This can make a significant difference:

>>> import numpy as np
>>> arr = np.array(['foo', 'bar', 'baz'])
>>> arr.dtype
dtype('S3')
>>> arr.nbytes
9

>>> import sys; import pandas as pd
>>> s = pd.Series(['foo', 'bar', 'baz'])
dtype('O')
>>> sum(sys.getsizeof(x) for x in s)
120

You may want to avoid using string columns, or find a way of representing string data as numbers.

If you have a dataframe that contains many repeated values (NaN is very common), then you can use a sparse data structure to reduce memory usage:

>>> df1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 39681584 entries, 0 to 39681583
Data columns (total 1 columns):
foo    float64
dtypes: float64(1)
memory usage: 605.5 MB

>>> df1.shape
(39681584, 1)

>>> df1.foo.isnull().sum() * 100. / len(df1)
20.628483479893344 # so 20% of values are NaN

>>> df1.to_sparse().info()
<class 'pandas.sparse.frame.SparseDataFrame'>
Int64Index: 39681584 entries, 0 to 39681583
Data columns (total 1 columns):
foo    float64
dtypes: float64(1)
memory usage: 543.0 MB

Viewing Memory Usage

You can view the memory usage (docs):

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 39681584 entries, 0 to 39681583
Data columns (total 14 columns):
...
dtypes: datetime64[ns](1), float64(8), int64(1), object(4)
memory usage: 4.4+ GB

As of pandas 0.17.1, you can also do df.info(memory_usage='deep') to see memory usage including objects.


As noted in the comments, there are some things to try: gc.collect (@EdChum) may clear stuff, for example. At least from my experience, these things sometimes work and often don't.

There is one thing that always works, however, because it is done at the OS, not language, level.

Suppose you have a function that creates an intermediate huge DataFrame, and returns a smaller result (which might also be a DataFrame):

def huge_intermediate_calc(something):
    ...
    huge_df = pd.DataFrame(...)
    ...
    return some_aggregate

Then if you do something like

import multiprocessing

result = multiprocessing.Pool(1).map(huge_intermediate_calc, [something_])[0]

Then the function is executed at a different process. When that process completes, the OS retakes all the resources it used. There's really nothing Python, pandas, the garbage collector, could do to stop that.


Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to pandas

xlrd.biffh.XLRDError: Excel xlsx file; not supported Pandas Merging 101 How to increase image size of pandas.DataFrame.plot in jupyter notebook? Trying to merge 2 dataframes but get ValueError Python Pandas User Warning: Sorting because non-concatenation axis is not aligned How to show all of columns name on pandas dataframe? Pandas/Python: Set value of one column based on value in another column Python Pandas - Find difference between two data frames Pandas get the most frequent values of a column Python convert object to float

Examples related to memory

How does the "view" method work in PyTorch? How do I release memory used by a pandas dataframe? How to solve the memory error in Python Docker error : no space left on device Default Xmxsize in Java 8 (max heap size) How to set Apache Spark Executor memory What is the best way to add a value to an array in state How do I read a large csv file with pandas? How to clear variables in ipython? Error occurred during initialization of VM Could not reserve enough space for object heap Could not create the Java virtual machine