Lazy Method for Reading Big File in Python

Question

I have a very big file 4GB and when I try to read it my computer hangs  So I want to read it piece by piece and after processing each piece store the processed piece into another file and read next piece   Is there any method to yield these pieces    I would love to have a lazy method

User · Answer

If your computer  OS and python are 64-bit  then you can use the mmap module to map the contents of the file into memory and access it with indices and slices  Here an example from the documentation   import mmap with open  hello txt    r    as f        memory-map the file  size 0 means whole file     map   mmap mmap f fileno    0        read content via standard file methods     print map readline      prints  Hello Python         read content via slice notation     print map  5     prints  Hello        update content using slice notation        note that new content must have same size     map 6       world  n            and read again using standard file methods     map seek 0      print map readline      prints  Hello  world         close the map     map close     If either your computer  OS or python are 32-bit  then mmap-ing large files can reserve large parts of your address space and starve your program of memory

User · Answer

I m in a somewhat similar situation  It s not clear whether you know chunk size in bytes  I usually don t  but the number of records  lines  that is required is known   def get line         with open  4gb file   as file           for i in file               yield i  lines required   100 gen   get line   chunk    i for i  j in zip gen  range lines required      Update  Thanks nosklo  Here s what I meant  It almost works  except that it loses a line  between  chunks   chunk    next gen  for i in range lines required     Does the trick w o losing any lines  but it doesn t look very nice

User · Answer

In Python 3 8  you can use  read   in a while loop  with open  quot somefile txt quot   as f      while chunk    f read 8192           do something chunk   Of course  you can use any chunk size you want  you don t have to use 8192  2  13  bytes  Unless your file s size happens to be a multiple of your chunk size  the last chunk will be smaller than your chunk size

User · Answer

f         file-like object  i e  supporting read size  function and            returning empty string    when there is nothing to read  def chunked file  chunk size       return iter lambda  file read chunk size        for data in chunked f  65536         process the data   UPDATE  The approach is best explained in https   stackoverflow com a 4566523 38592

User · Answer

i am not allowed to comment due to my low reputation  but SilentGhosts solution should be much easier with file readlines  sizehint    python file methods  edit  SilentGhost is right  but this should be better than   s       for i in xrange 100       s    file next

User · Answer

you can use following code   file obj   open  big file      open   returns a file object  then use os stat for getting size  file size   os stat  big file   st size  for i in range  file size 1024       print file obj read 1024

User · Answer

To write a lazy function  just use yield   def read in chunks file object  chunk size 1024          Lazy function  generator  to read a file piece by piece      Default chunk size  1k         while True          data   file object read chunk size          if not data              break         yield data   with open  really big file dat   as f      for piece in read in chunks f           process data piece      Another option would be to use iter and a helper function   f   open  really big file dat   def read1k        return f read 1024   for piece in iter read1k           process data piece      If the file is line-based  the file object is already a lazy generator of lines   for line in open  really big file dat        process data line

User · Answer

file readlines   takes in an optional size argument which approximates the number of lines read in the lines returned   bigfile   open  bigfilename   r   tmp lines   bigfile readlines BUF SIZE  while tmp lines      process  line for line in tmp lines       tmp lines   bigfile readlines BUF SIZE

User · Answer

To process line by line  this is an elegant solution     def stream lines file name       file   open file name      while True        line   file readline         if not line          file close           break       yield line   As long as there re no blank lines

User · Answer

Refer to python s official documentation  https   docs python org 3 library functions html iter Maybe this method is more pythonic  from functools import partial   quot  quot  quot A file object returned by open   is a iterator with read method which could specify current read s block size quot  quot  quot  with open  mydata db    r   as f in       part read   partial f in read  1024 1024      iterator   iter part read  b         for index  block in enumerate iterator  start 1           block   process block block       process your block data                  with open f  index  txt    w   as f out              f out write block

User · Answer

There are already many good answers  but if your entire file is on a single line and you still want to process  rows   as opposed to fixed-size blocks   these answers will not help you   99  of the time  it is possible to process files line by line  Then  as suggested in this answer  you can to use the file object itself as lazy generator   with open  big csv   as f      for line in f          process line    However  I once ran into a very very big  almost  single line file  where the row separator was in fact not   n  but        Reading line by line was not an option  but I still needed to process it row by row  Converting    to   n  before processing was also out of the question  because some of the fields of this csv contained   n   free text user input   Using the csv library was also ruled out because the fact that  at least in early versions of the lib  it is hardcoded to read the input line by line    For these kind of situations  I created the following snippet   def rows f  chunksize 1024  sep                   Read a file where the row separator is     lazily       Usage        gt  gt  gt  with open  big csv   as f       gt  gt  gt      for r in rows f        gt  gt  gt          process row              curr row          while True          chunk   f read chunksize          if chunk          End of file             yield curr row             break         while True              i   chunk find sep              if i    -1                  break             yield curr row   chunk  i              curr row                  chunk   chunk i 1           curr row    chunk   I was able to use it successfully to solve my problem  It has been extensively tested  with various chunk sizes     Test suite  for those who want to convince themselves   test file    test file   def cleanup func       def wrapper  args    kwargs           func  args    kwargs          os unlink test file      return wrapper   cleanup def test empty chunksize 1024       with open test file   w   as f          f write         with open test file  as f          assert len list rows f  chunksize chunksize       1   cleanup def test 1 char 2 rows chunksize 1024       with open test file   w   as f          f write          with open test file  as f          assert len list rows f  chunksize chunksize       2   cleanup def test 1 char chunksize 1024       with open test file   w   as f          f write  a       with open test file  as f          assert len list rows f  chunksize chunksize       1   cleanup def test 1025 chars 1 row chunksize 1024       with open test file   w   as f          for i in range 1025               f write  a       with open test file  as f          assert len list rows f  chunksize chunksize       1   cleanup def test 1024 chars 2 rows chunksize 1024       with open test file   w   as f          for i in range 1023               f write  a           f write          with open test file  as f          assert len list rows f  chunksize chunksize       2   cleanup def test 1025 chars 1026 rows chunksize 1024       with open test file   w   as f          for i in range 1025               f write          with open test file  as f          assert len list rows f  chunksize chunksize       1026   cleanup def test 2048 chars 2 rows chunksize 1024       with open test file   w   as f          for i in range 1022               f write  a           f write              f write  a             -- end of 1st chunk --         for i in range 1024               f write  a             -- end of 2nd chunk     with open test file  as f          assert len list rows f  chunksize chunksize       2   cleanup def test 2049 chars 2 rows chunksize 1024       with open test file   w   as f          for i in range 1022               f write  a           f write              f write  a             -- end of 1st chunk --         for i in range 1024               f write  a             -- end of 2nd chunk         f write  a       with open test file  as f          assert len list rows f  chunksize chunksize       2  if   name         main         for chunksize in  1  2  4  8  16  32  64  128  256  512  1024           test empty chunksize          test 1 char 2 rows chunksize          test 1 char chunksize          test 1025 chars 1 row chunksize          test 1024 chars 2 rows chunksize          test 1025 chars 1026 rows chunksize          test 2048 chars 2 rows chunksize          test 2049 chars 2 rows chunksize

User · Answer

I think we can write like this   def read file path  block size 1024        with open path   rb   as f           while True               piece   f read block size               if piece                   yield piece              else                   return  for piece in read file path       process piece piece

[python] Lazy Method for Reading Big File in Python?

Examples related to python

Examples related to file-io

Examples related to generator