How can I read large text files in Python line by line without loading it into memory

Question

I need to read a large file  line by line  Lets say that file has more than 5GB and I need to read each line  but obviously I do not want to use readlines   because it will create a very large list in the memory   How will the code below work for this case  Is xreadlines itself reading one by one into memory  Is the generator expression needed   f    line for line in open  log txt   xreadlines       how much is loaded in memory   f next       Plus  what can I do to read this in reverse order  just as the Linux tail command   I found   http   code google com p pytailer   and   python head  tail and backward read by lines of a text file   Both worked very well

User · Answer

The blaze project has come a long way over the last 6 years. It has a simple API covering a useful subset of pandas features.

dask.dataframe takes care of chunking internally, supports many parallelisable operations and allows you to export slices back to pandas easily for in-memory operations.

import dask.dataframe as dd

df = dd.read_csv('filename.csv')
df.head(10)  # return first 10 rows
df.tail(10)  # return last 10 rows

# iterate rows
for idx, row in df.iterrows():
    ...

# group by my_field and return mean
df.groupby(df.my_field).value.mean().compute()

# slice by column
df[df.my_field=='XYZ'].compute()

User · Answer

The best solution I found regarding this  and I tried it on 330 MB file   lineno   500 line length   8 with open  catfour txt    r   as file      file seek lineno    line length   2       print file readline    end       Where line length is the number of characters in a single line  For example  abcd  has line length 4   I have added 2 in line length to skip the   n  character and move to the next character

User · Answer

How about this  Divide your file into chunks and then read it line by line  because when you read a file  your operating system will cache the next line  If you are reading the file line by line  you are not making efficient use of the cached information    Instead  divide the file into chunks and load the whole chunk into memory and then do your processing   def chunks file size 1024       while 1           startat fh tell           print startat  file s object current position from the start         fh seek size 1   offset from current postion -- gt 1         data fh readline           yield startat fh tell  -startat  doesnt store whole list in memory         if not data              break if os path isfile fname       try          fh open fname  rb        except IOError as e   file -- gt  permission denied         print  I O error  0     1   format e errno  e strerror      except Exception as e1   handle other exceptions such as attribute errors         print  Unexpected error   0   format e1      for ele in chunks fh           fh seek ele 0   startat         data fh read ele 1   endat         print data

User · Answer

I couldn t believe that it could be as easy as  john-la-rooy s answer made it seem  So  I recreated the cp command using line by line reading and writing  It s CRAZY FAST      usr bin env python3 6  import sys  with open sys argv 2    w   as outfile      with open sys argv 1   as infile          for line in infile              outfile write line

User · Answer

Please try this   with open  filename   r  buffering 100000  as f      for line in f          print line

User · Answer

An old school approach   fh   open file name   rt   line   fh readline   while line        do stuff with line     line   fh readline   fh close

User · Answer

I demonstrated a parallel byte level random access approach here in this other question    Getting number of lines in a text file without readlines  Some of the answers already provided are nice and concise  I like some of them  But it really depends what you want to do with the data that s in the file  In my case I just wanted to count lines  as fast as possible on big text files  My code can be modified to do other things too of course  like any code

User · Answer

Thank you  I have recently converted to python 3 and have been frustrated by using readlines 0  to read large files  This solved the problem  But to get each line  I had to do a couple extra steps  Each line was preceded by a  b   which I guess that it was in binary format  Using  decode utf-8   changed it ascii   Then I had to remove a    n  in the middle of each line   Then I split the lines at the new line   b data  fh read ele 1    endat This is one chunk of ascii data in binary format         a data   binascii b2a qp b data   decode  utf-8     Data chunk in  split  ascii format         data chunk    a data replace    n      strip     Splitting characters removed         data list   data chunk split   n     List containing lines in chunk          print data list   n            time sleep 1          for j in range len data list     iterate through data list to get each item              i    1             line of data   data list j              print line of data    Here is the code starting just above  print data  in Arohi s code

User · Answer

All you need to do is use the file object as an iterator   for line in open  log txt        do something with line    Even better is using context manager in recent Python versions   with open  log txt   as fileobject      for line in fileobject          do something with line    This will automatically close the file as well

User · Answer

Here s what you do if you dont have newlines in the file   with open  large text txt   as f    while True      c   f read 1024      if not c        break     print c

User · Answer

This might be useful when you want to work in parallel and read only chunks of data but keep it clean with new lines   def readInChunks fileObj  chunkSize 1024       while True          data   fileObj read chunkSize          if not data              break         while data -1        n               data  fileObj read 1          yield data

User · Answer

Heres the code for loading text files of any size without causing memory issues  It support gigabytes sized files  https   gist github com iyvinjose e6c1cb2821abd5f01fd1b9065cbc759d  download the file data loading utils py and import it into your code  usage  import data loading utils py py file name    file name ext  CHUNK SIZE   1000000   def process lines data  eof  file name          check if end of file reached     if not eof             process data  data is one single line of the file      else             end of file reached  data loading utils read lines from file as data chunks file name  chunk size CHUNK SIZE  callback self process lines    process lines method is the callback function  It will be called for all the lines  with parameter data representing one single line of the file at a time   You can configure the variable CHUNK SIZE depending on your machine hardware configurations

User · Answer

I provided this answer because Keith s  while succinct  doesn t close the file explicitly  with open  log txt   as infile      for line in infile          do something with line

User · Answer

f open  filename   r   read   f1 f split   n   for i in range  len f1        do something with f1 i     hope this helps

User · Answer

You are better off using an iterator instead   Relevant   http   docs python org library fileinput html  From the docs   import fileinput for line in fileinput input  filename        process line    This will avoid copying the whole file into memory at once

[python] How can I read large text files in Python, line by line, without loading it into memory?

Examples related to python