[python] Python readlines() usage and efficient practice for reading

The short version is: The efficient way to use readlines() is to not use it. Ever.


I read some doc notes on readlines(), where people has claimed that this readlines() reads whole file content into memory and hence generally consumes more memory compared to readline() or read().

The documentation for readlines() explicitly guarantees that it reads the whole file into memory, and parses it into lines, and builds a list full of strings out of those lines.

But the documentation for read() likewise guarantees that it reads the whole file into memory, and builds a string, so that doesn't help.


On top of using more memory, this also means you can't do any work until the whole thing is read. If you alternate reading and processing in even the most naive way, you will benefit from at least some pipelining (thanks to the OS disk cache, DMA, CPU pipeline, etc.), so you will be working on one batch while the next batch is being read. But if you force the computer to read the whole file in, then parse the whole file, then run your code, you only get one region of overlapping work for the entire file, instead of one region of overlapping work per read.


You can work around this in three ways:

  1. Write a loop around readlines(sizehint), read(size), or readline().
  2. Just use the file as a lazy iterator without calling any of these.
  3. mmap the file, which allows you to treat it as a giant string without first reading it in.

For example, this has to read all of foo at once:

with open('foo') as f:
    lines = f.readlines()
    for line in lines:
        pass

But this only reads about 8K at a time:

with open('foo') as f:
    while True:
        lines = f.readlines(8192)
        if not lines:
            break
        for line in lines:
            pass

And this only reads one line at a time—although Python is allowed to (and will) pick a nice buffer size to make things faster.

with open('foo') as f:
    while True:
        line = f.readline()
        if not line:
            break
        pass

And this will do the exact same thing as the previous:

with open('foo') as f:
    for line in f:
        pass

Meanwhile:

but should the garbage collector automatically clear that loaded content from memory at the end of my loop, hence at any instant my memory should have only the contents of my currently processed file right ?

Python doesn't make any such guarantees about garbage collection.

The CPython implementation happens to use refcounting for GC, which means that in your code, as soon as file_content gets rebound or goes away, the giant list of strings, and all of the strings within it, will be freed to the freelist, meaning the same memory can be reused again for your next pass.

However, all those allocations, copies, and deallocations aren't free—it's much faster to not do them than to do them.

On top of that, having your strings scattered across a large swath of memory instead of reusing the same small chunk of memory over and over hurts your cache behavior.

Plus, while the memory usage may be constant (or, rather, linear in the size of your largest file, rather than in the sum of your file sizes), that rush of mallocs to expand it the first time will be one of the slowest things you do (which also makes it much harder to do performance comparisons).


Putting it all together, here's how I'd write your program:

for filename in os.listdir(input_dir):
    with open(filename, 'rb') as f:
        if filename.endswith(".gz"):
            f = gzip.open(fileobj=f)
        words = (line.split(delimiter) for line in f)
        ... my logic ...  

Or, maybe:

for filename in os.listdir(input_dir):
    if filename.endswith(".gz"):
        f = gzip.open(filename, 'rb')
    else:
        f = open(filename, 'rb')
    with contextlib.closing(f):
        words = (line.split(delimiter) for line in f)
        ... my logic ...

Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to performance

Why is 2 * (i * i) faster than 2 * i * i in Java? What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism? How to check if a key exists in Json Object and get its value Why does C++ code for testing the Collatz conjecture run faster than hand-written assembly? Most efficient way to map function over numpy array The most efficient way to remove first N elements in a list? Fastest way to get the first n elements of a List into an Array Why is "1000000000000000 in range(1000000000000001)" so fast in Python 3? pandas loc vs. iloc vs. at vs. iat? Android Recyclerview vs ListView with Viewholder

Examples related to memory

How does the "view" method work in PyTorch? How do I release memory used by a pandas dataframe? How to solve the memory error in Python Docker error : no space left on device Default Xmxsize in Java 8 (max heap size) How to set Apache Spark Executor memory What is the best way to add a value to an array in state How do I read a large csv file with pandas? How to clear variables in ipython? Error occurred during initialization of VM Could not reserve enough space for object heap Could not create the Java virtual machine

Examples related to python-2.6

Suppress InsecureRequestWarning: Unverified HTTPS request is being made in Python2.6 How to fix symbol lookup error: undefined symbol errors in a cluster environment Python readlines() usage and efficient practice for reading sort dict by value python Visibility of global variables in imported modules bash: pip: command not found How to make an unaware datetime timezone aware in python Get all object attributes in Python? How to convert a set to a list in python? How do you get the current text contents of a QComboBox?

Examples related to readlines

How to get length of a list of lists in python Python readlines() usage and efficient practice for reading How to read a file without newlines? Break string into list of characters in Python How to read a file line-by-line into a list?