Python readlines usage and efficient practice for reading

Question

I have a problem to parse 1000 s of text files around 3000 lines in each file of  400KB size   in a folder  I did read them using readlines        for filename in os listdir  input dir           if filename endswith   gz              f   gzip open file   rb          else            f   open file   rb           file content   f readlines          f close      len file   len file content     while i  lt  len file         line   file content i  split delimiter              my logic              i    1     This works completely fine for sample from my inputs  50 100 files    When I ran on the whole input more than 5K files  the time-taken was nowhere close to linear increment I planned to do an performance analysis and did a Cprofile analysis  The time taken for the more files in exponentially increasing with reaching worse rates when inputs reached to 7K files    Here is the the cumulative time-taken for readlines    first -  354 files sample from input  and  second -  7473 files  whole input     ncalls  tottime  percall  cumtime  percall filename lineno function   354    0 192    0 001      0 192      0 001  method  readlines  of  file  objects   7473 1329 380    0 178    1329 380      0 178  method  readlines  of  file  objects    Because of this  the time-taken by my code is not linearly scaling as the input increases  I read some doc notes on readlines    where people has claimed that this readlines   reads whole file content into memory and hence generally consumes more memory compared to readline   or read     I agree with this point  but should the garbage collector automatically clear that loaded content from memory at the end of my loop  hence at any instant my memory should have only the contents of my currently processed file right   But  there is some catch here  Can somebody give some insights into this issue    Is this an inherent behavior of readlines   or my wrong interpretation of python garbage collector  Glad to know    Also  suggest some alternative ways of doing the same in memory and time efficient manner  TIA

User · Answer

Read line by line  not the whole file   for line in open file name   rb          process line here   Even better use with for automatically closing the file   with open file name   rb   as f      for line in f            process line here   The above will read the file object using an iterator  one line at a time

User · Answer

The short version is  The efficient way to use readlines   is to not use it  Ever        I read some doc notes on readlines    where people has claimed that this readlines   reads whole file content into memory and hence generally consumes more memory compared to readline   or read      The documentation for readlines   explicitly guarantees that it reads the whole file into memory  and parses it into lines  and builds a list full of strings out of those lines   But the documentation for read   likewise guarantees that it reads the whole file into memory  and builds a string  so that doesn t help     On top of using more memory  this also means you can t do any work until the whole thing is read  If you alternate reading and processing in even the most naive way  you will benefit from at least some pipelining  thanks to the OS disk cache  DMA  CPU pipeline  etc    so you will be working on one batch while the next batch is being read  But if you force the computer to read the whole file in  then parse the whole file  then run your code  you only get one region of overlapping work for the entire file  instead of one region of overlapping work per read     You can work around this in three ways    Write a loop around readlines sizehint   read size   or readline    Just use the file as a lazy iterator without calling any of these  mmap the file  which allows you to treat it as a giant string without first reading it in    For example  this has to read all of foo at once   with open  foo   as f      lines   f readlines       for line in lines          pass   But this only reads about 8K at a time   with open  foo   as f      while True          lines   f readlines 8192          if not lines              break         for line in lines              pass   And this only reads one line at a time   although Python is allowed to  and will  pick a nice buffer size to make things faster   with open  foo   as f      while True          line   f readline           if not line              break         pass   And this will do the exact same thing as the previous   with open  foo   as f      for line in f          pass     Meanwhile      but should the garbage collector automatically clear that loaded content from memory at the end of my loop  hence at any instant my memory should have only the contents of my currently processed file right     Python doesn t make any such guarantees about garbage collection   The CPython implementation happens to use refcounting for GC  which means that in your code  as soon as file content gets rebound or goes away  the giant list of strings  and all of the strings within it  will be freed to the freelist  meaning the same memory can be reused again for your next pass   However  all those allocations  copies  and deallocations aren t free   it s much faster to not do them than to do them   On top of that  having your strings scattered across a large swath of memory instead of reusing the same small chunk of memory over and over hurts your cache behavior   Plus  while the memory usage may be constant  or  rather  linear in the size of your largest file  rather than in the sum of your file sizes   that rush of mallocs to expand it the first time will be one of the slowest things you do  which also makes it much harder to do performance comparisons      Putting it all together  here s how I d write your program   for filename in os listdir input dir       with open filename   rb   as f          if filename endswith   gz                f   gzip open fileobj f          words    line split delimiter  for line in f              my logic         Or  maybe   for filename in os listdir input dir       if filename endswith   gz            f   gzip open filename   rb       else          f   open filename   rb       with contextlib closing f           words    line split delimiter  for line in f              my logic

[python] Python readlines() usage and efficient practice for reading

Examples related to python

Examples related to performance

Examples related to memory

Examples related to python-2.6

Examples related to readlines