[python] Python readlines() usage and efficient practice for reading

I have a problem to parse 1000's of text files(around 3000 lines in each file of ~400KB size ) in a folder. I did read them using readlines,

   for filename in os.listdir (input_dir) :
       if filename.endswith(".gz"):
          f = gzip.open(file, 'rb')
       else:
          f = open(file, 'rb')

       file_content = f.readlines()
       f.close()
   len_file = len(file_content)
   while i < len_file:
       line = file_content[i].split(delimiter) 
       ... my logic ...  
       i += 1  

This works completely fine for sample from my inputs (50,100 files) . When I ran on the whole input more than 5K files, the time-taken was nowhere close to linear increment.I planned to do an performance analysis and did a Cprofile analysis. The time taken for the more files in exponentially increasing with reaching worse rates when inputs reached to 7K files.

Here is the the cumulative time-taken for readlines , first -> 354 files(sample from input) and second -> 7473 files (whole input)

 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 354    0.192    0.001    **0.192**    0.001 {method 'readlines' of 'file' objects}
 7473 1329.380    0.178  **1329.380**    0.178 {method 'readlines' of 'file' objects}

Because of this, the time-taken by my code is not linearly scaling as the input increases. I read some doc notes on readlines(), where people has claimed that this readlines() reads whole file content into memory and hence generally consumes more memory compared to readline() or read().

I agree with this point, but should the garbage collector automatically clear that loaded content from memory at the end of my loop, hence at any instant my memory should have only the contents of my currently processed file right ? But, there is some catch here. Can somebody give some insights into this issue.

Is this an inherent behavior of readlines() or my wrong interpretation of python garbage collector. Glad to know.

Also, suggest some alternative ways of doing the same in memory and time efficient manner. TIA.

This question is related to python performance memory python-2.6 readlines

The answer is


Read line by line, not the whole file:

for line in open(file_name, 'rb'):
    # process line here

Even better use with for automatically closing the file:

with open(file_name, 'rb') as f:
    for line in f:
        # process line here

The above will read the file object using an iterator, one line at a time.


Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to performance

Why is 2 * (i * i) faster than 2 * i * i in Java? What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism? How to check if a key exists in Json Object and get its value Why does C++ code for testing the Collatz conjecture run faster than hand-written assembly? Most efficient way to map function over numpy array The most efficient way to remove first N elements in a list? Fastest way to get the first n elements of a List into an Array Why is "1000000000000000 in range(1000000000000001)" so fast in Python 3? pandas loc vs. iloc vs. at vs. iat? Android Recyclerview vs ListView with Viewholder

Examples related to memory

How does the "view" method work in PyTorch? How do I release memory used by a pandas dataframe? How to solve the memory error in Python Docker error : no space left on device Default Xmxsize in Java 8 (max heap size) How to set Apache Spark Executor memory What is the best way to add a value to an array in state How do I read a large csv file with pandas? How to clear variables in ipython? Error occurred during initialization of VM Could not reserve enough space for object heap Could not create the Java virtual machine

Examples related to python-2.6

Suppress InsecureRequestWarning: Unverified HTTPS request is being made in Python2.6 How to fix symbol lookup error: undefined symbol errors in a cluster environment Python readlines() usage and efficient practice for reading sort dict by value python Visibility of global variables in imported modules bash: pip: command not found How to make an unaware datetime timezone aware in python Get all object attributes in Python? How to convert a set to a list in python? How do you get the current text contents of a QComboBox?

Examples related to readlines

How to get length of a list of lists in python Python readlines() usage and efficient practice for reading How to read a file without newlines? Break string into list of characters in Python How to read a file line-by-line into a list?