How to read a large file line by line

Question

I want to read a file line by line  but without completely loading it in memory   My file is too large to open in memory  and if try to do so I always get out of memory errors   The file size is 1 GB

User · Answer

To strip newlines   with open file path   rU   as f      for line terminated in f          line   line terminated rstrip   n                 With universal newline support all text file lines will seem to be terminated with   n   whatever the terminators in the file    r     n   or   r n     EDIT - To specify universal newline support    Python 2 on Unix - open file path  mode  rU   - required  thanks  Dave  Python 2 on Windows - open file path  mode  rU   - optional Python 3 - open file path  newline None  - optional   The newline parameter is only supported in Python 3 and defaults to None  The mode parameter defaults to  r  in all cases  The U is deprecated in Python 3  In Python 2 on Windows some other mechanism appears to translate  r n to  n   Docs    open   for Python 2 open   for Python 3   To preserve native line terminators   with open file path   rb   as f      with line native terminated in f                Binary mode can still parse the file into lines with in   Each line will have whatever terminators it has in the file   Thanks to  katrielalex s answer  Python s open   doc  and iPython experiments

User · Answer

The obvious answer wasn t there in all the responses  PHP has a neat streaming delimiter parser available made for exactly that purpose   fp   fopen  quot  path to the file quot    quot r  quot    while    line   stream get line  fp  1024   1024   quot  n quot        false      echo  line    fclose  fp

User · Answer

if   file   fopen  file txt    r          while  feof  file              line   fgets  file             do same stuff with the  line           fclose  file

User · Answer

Some context up front as to where I am coming from. Code snippets are at the end.

When I can, I prefer to use an open source tool like H2O to do super high performance parallel CSV file reads, but this tool is limited in feature set. I end up writing a lot of code to create data science pipelines before feeding to H2O cluster for the supervised learning proper.

I have been reading files like 8GB HIGGS dataset from UCI repo and even 40GB CSV files for data science purposes significantly faster by adding lots of parallelism with the multiprocessing library's pool object and map function. For example clustering with nearest neighbor searches and also DBSCAN and Markov clustering algorithms requires some parallel programming finesse to bypass some seriously challenging memory and wall clock time problems.

I usually like to break the file row-wise into parts using gnu tools first and then glob-filemask them all to find and read them in parallel in the python program. I use something like 1000+ partial files commonly. Doing these tricks helps immensely with processing speed and memory limits.

The pandas dataframe.read_csv is single threaded so you can do these tricks to make pandas quite faster by running a map() for parallel execution. You can use htop to see that with plain old sequential pandas dataframe.read_csv, 100% cpu on just one core is the actual bottleneck in pd.read_csv, not the disk at all.

I should add I'm using an SSD on fast video card bus, not a spinning HD on SATA6 bus, plus 16 CPU cores.

Also, another technique that I discovered works great in some applications is parallel CSV file reads all within one giant file, starting each worker at different offset into the file, rather than pre-splitting one big file into many part files. Use python's file seek() and tell() in each parallel worker to read the big text file in strips, at different byte offset start-byte and end-byte locations in the big file, all at the same time concurrently. You can do a regex findall on the bytes, and return the count of linefeeds. This is a partial sum. Finally sum up the partial sums to get the global sum when the map function returns after the workers finished.

Following is some example benchmarks using the parallel byte offset trick:

I use 2 files: HIGGS.csv is 8 GB. It is from the UCI machine learning repository. all_bin .csv is 40.4 GB and is from my current project. I use 2 programs: GNU wc program which comes with Linux, and the pure python fastread.py program which I developed.

HP-Z820:/mnt/fastssd/fast_file_reader$ ls -l /mnt/fastssd/nzv/HIGGS.csv
-rw-rw-r-- 1 8035497980 Jan 24 16:00 /mnt/fastssd/nzv/HIGGS.csv

HP-Z820:/mnt/fastssd$ ls -l all_bin.csv
-rw-rw-r-- 1 40412077758 Feb  2 09:00 all_bin.csv

ga@ga-HP-Z820:/mnt/fastssd$ time python fastread.py --fileName="all_bin.csv" --numProcesses=32 --balanceFactor=2
2367496

real    0m8.920s
user    1m30.056s
sys 2m38.744s

In [1]: 40412077758. / 8.92
Out[1]: 4530501990.807175

That’s some 4.5 GB/s, or 45 Gb/s, file slurping speed. That ain’t no spinning hard disk, my friend. That’s actually a Samsung Pro 950 SSD.

Below is the speed benchmark for the same file being line-counted by gnu wc, a pure C compiled program.

What is cool is you can see my pure python program essentially matched the speed of the gnu wc compiled C program in this case. Python is interpreted but C is compiled, so this is a pretty interesting feat of speed, I think you would agree. Of course, wc really needs to be changed to a parallel program, and then it would really beat the socks off my python program. But as it stands today, gnu wc is just a sequential program. You do what you can, and python can do parallel today. Cython compiling might be able to help me (for some other time). Also memory mapped files was not explored yet.

HP-Z820:/mnt/fastssd$ time wc -l all_bin.csv
2367496 all_bin.csv

real    0m8.807s
user    0m1.168s
sys 0m7.636s


HP-Z820:/mnt/fastssd/fast_file_reader$ time python fastread.py --fileName="HIGGS.csv" --numProcesses=16 --balanceFactor=2
11000000

real    0m2.257s
user    0m12.088s
sys 0m20.512s

HP-Z820:/mnt/fastssd/fast_file_reader$ time wc -l HIGGS.csv
11000000 HIGGS.csv

real    0m1.820s
user    0m0.364s
sys 0m1.456s

Conclusion: The speed is good for a pure python program compared to a C program. However, it’s not good enough to use the pure python program over the C program, at least for linecounting purpose. Generally the technique can be used for other file processing, so this python code is still good.

Question: Does compiling the regex just one time and passing it to all workers will improve speed? Answer: Regex pre-compiling does NOT help in this application. I suppose the reason is that the overhead of process serialization and creation for all the workers is dominating.

One more thing. Does parallel CSV file reading even help? Is the disk the bottleneck, or is it the CPU? Many so-called top-rated answers on stackoverflow contain the common dev wisdom that you only need one thread to read a file, best you can do, they say. Are they sure, though?

Let’s find out:

HP-Z820:/mnt/fastssd/fast_file_reader$ time python fastread.py --fileName="HIGGS.csv" --numProcesses=16 --balanceFactor=2
11000000

real    0m2.256s
user    0m10.696s
sys 0m19.952s

HP-Z820:/mnt/fastssd/fast_file_reader$ time python fastread.py --fileName="HIGGS.csv" --numProcesses=1 --balanceFactor=1
11000000

real    0m17.380s
user    0m11.124s
sys 0m6.272s

Oh yes, yes it does. Parallel file reading works quite well. Well there you go!

Ps. In case some of you wanted to know, what if the balanceFactor was 2 when using a single worker process? Well, it’s horrible:

HP-Z820:/mnt/fastssd/fast_file_reader$ time python fastread.py --fileName="HIGGS.csv" --numProcesses=1 --balanceFactor=2
11000000

real    1m37.077s
user    0m12.432s
sys 1m24.700s

Key parts of the fastread.py python program:

fileBytes = stat(fileName).st_size  # Read quickly from OS how many bytes are in a text file
startByte, endByte = PartitionDataToWorkers(workers=numProcesses, items=fileBytes, balanceFactor=balanceFactor)
p = Pool(numProcesses)
partialSum = p.starmap(ReadFileSegment, zip(startByte, endByte, repeat(fileName))) # startByte is already a list. fileName is made into a same-length list of duplicates values.
globalSum = sum(partialSum)
print(globalSum)


def ReadFileSegment(startByte, endByte, fileName, searchChar='\n'):  # counts number of searchChar appearing in the byte range
    with open(fileName, 'r') as f:
        f.seek(startByte-1)  # seek is initially at byte 0 and then moves forward the specified amount, so seek(5) points at the 6th byte.
        bytes = f.read(endByte - startByte + 1)
        cnt = len(re.findall(searchChar, bytes)) # findall with implicit compiling runs just as fast here as re.compile once + re.finditer many times.
    return cnt

The def for PartitionDataToWorkers is just ordinary sequential code. I left it out in case someone else wants to get some practice on what parallel programming is like. I gave away for free the harder parts: the tested and working parallel code, for your learning benefit.

Thanks to: The open-source H2O project, by Arno and Cliff and the H2O staff for their great software and instructional videos, which have provided me the inspiration for this pure python high performance parallel byte offset reader as shown above. H2O does parallel file reading using java, is callable by python and R programs, and is crazy fast, faster than anything on the planet at reading big CSV files.

User · Answer

Two memory efficient ways in ranked order (first is best) -

use of with - supported from python 2.5 and above
use of yield if you really want to have control over how much to read

1. use of `with`

with is the nice and efficient pythonic way to read large files. advantages - 1) file object is automatically closed after exiting from with execution block. 2) exception handling inside the with block. 3) memory for loop iterates through the f file object line by line. internally it does buffered IO (to optimized on costly IO operations) and memory management.

with open("x.txt") as f:
    for line in f:
        do something with data

2. use of `yield`

Sometimes one might want more fine-grained control over how much to read in each iteration. In that case use iter & yield. Note with this method one explicitly needs close the file at the end.

def readInChunks(fileObj, chunkSize=2048):
    """
    Lazy function to read a file piece by piece.
    Default chunk size: 2kB.

    """
    while True:
        data = fileObj.read(chunkSize)
        if not data:
            break
        yield data

f = open('bigFile')
for chunk in readInChunks(f):
    do_something(chunk)
f.close()

Pitfalls and for the sake of completeness - below methods are not as good or not as elegant for reading large files but please read to get rounded understanding.

In Python, the most common way to read lines from a file is to do the following:

for line in open('myfile','r').readlines():
    do_something(line)

When this is done, however, the readlines() function (same applies for read() function) loads the entire file into memory, then iterates over it. A slightly better approach (the first mentioned two methods above are the best) for large files is to use the fileinput module, as follows:

import fileinput

for line in fileinput.input(['myfile']):
    do_something(line)

the fileinput.input() call reads lines sequentially, but doesn't keep them in memory after they've been read or even simply so this, since file in python is iterable.

References

Python with statement

User · Answer

lt  php echo   lt meta charset  utf-8  gt      k  1   f  1   fp   fopen  texttranslate txt    r    while  feof  fp          contents           for  i 1  i lt  1500  i             echo  k   --    fgets  fp     lt br gt    k             contents    fgets  fp             echo   lt hr gt        file put contents  Split new file    f   txt    contents   f        gt

User · Answer

From the python documentation for fileinput.input():

This iterates over the lines of all files listed in sys.argv[1:], defaulting to sys.stdin if the list is empty

further, the definition of the function is:

fileinput.FileInput([files[, inplace[, backup[, mode[, openhook]]]]])

reading between the lines, this tells me that files can be a list so you could have something like:

for each_line in fileinput.input([input_file, input_file]):
  do_something(each_line)

See here for more information

User · Answer

Katrielalex provided the way to open  amp  read one file   However the way your algorithm goes it reads the whole file for each line of the file  That means the overall amount of reading a file - and computing the Levenshtein distance - will be done N N if N is the amount of lines in the file  Since you re concerned about file size and don t want to keep it in memory  I am concerned about the resulting quadratic runtime  Your algorithm is in the O n 2  class of algorithms which often can be improved with specialization   I suspect that you already know the tradeoff of memory versus runtime here  but maybe you would want to investigate if there s an efficient way to compute multiple Levenshtein distances in parallel  If so it would be interesting to share your solution here   How many lines do your files have  and on what kind of machine  mem  amp  cpu power  does your algorithm have to run  and what s the tolerated runtime   Code would look like   with f outer as open input file   r        for line outer in f outer          with f inner as open input file   r                for line inner in f inner                  compute distance line outer  line inner    But the questions are how do you store the distances  matrix   and can you gain an advantage of preparing e g  the outer line for processing  or caching some intermediate results for reuse

User · Answer

If you re opening a big file  you probably want to use Generators alongside fgets   to avoid loading the whole file into memory           return Generator      fileData   function          file   fopen   DIR       file txt    r         if    file          die  file does not exist or cannot be opened         while    line   fgets  file       false            yield  line             fclose  file        Use it like this   foreach   fileData   as  line            line contains current line     This way you can process individual file lines inside the foreach     Note  Generators require    PHP 5 5

User · Answer

Be careful with the  while  feof     fgets    stuff  fgets can get an error  returnfing false  and loop forever without reaching the end of file   codaddict was closest to being correct but when your  while fgets  loop ends  check feof  if not true  then you had an error

User · Answer

This how I manage with very big file  tested with up to 100G   And it s faster than fgets     block  1024 1024   1MB or counld be any higher than HDD block size 2 if   fh   fopen  file txt    r            left         while   feof  fh       read the file         temp   fread  fh   block             fgetslines   explode   n   temp           fgetslines 0   left  fgetslines 0          if  feof  fh    left   array pop  lines                     foreach   fgetslines as  k   gt   line                 do smth with  line                    fclose  fh

User · Answer

Need to frequently read a large file from last position reading ?

I have created a script used to cut an Apache access.log file several times a day. So I needed to set a position cursor on last line parsed during last execution. To this end, I used file.seek() and file.seek() methods which allows the storage of the cursor in file.

My code :

ENCODING = "utf8"
CURRENT_FILE_DIR = os.path.dirname(os.path.abspath(__file__))

# This file is used to store the last cursor position
cursor_position = os.path.join(CURRENT_FILE_DIR, "access_cursor_position.log")

# Log file with new lines
log_file_to_cut = os.path.join(CURRENT_FILE_DIR, "access.log")
cut_file = os.path.join(CURRENT_FILE_DIR, "cut_access", "cut.log")

# Set in from_line 
from_position = 0
try:
    with open(cursor_position, "r", encoding=ENCODING) as f:
        from_position = int(f.read())
except Exception as e:
    pass

# We read log_file_to_cut to put new lines in cut_file
with open(log_file_to_cut, "r", encoding=ENCODING) as f:
    with open(cut_file, "w", encoding=ENCODING) as fw:
        # We set cursor to the last position used (during last run of script)
        f.seek(from_position)
        for line in f:
            fw.write("%s" % (line))

    # We save the last position of cursor for next usage
    with open(cursor_position, "w", encoding=ENCODING) as fw:
        fw.write(str(f.tell()))

User · Answer

foreach  new SplFileObject   FILE    as  line        echo  line

User · Answer

The correct  fully Pythonic way to read a file is the following   with open      as f      for line in f            Do something with  line    The with statement handles opening and closing the file  including if an exception is raised in the inner block  The for line in f treats the file object f as an iterable  which automatically uses buffered I O and memory management so you don t have to worry about large files      There should be one -- and preferably only one -- obvious way to do it

User · Answer

One of the popular solutions to this question will have issues with the new line character  It can be fixed pretty easy with a simple str replace    handle   fopen  some file txt    r    if   handle        while    line   fgets  handle       false             line   str replace   n        line             fclose  handle

User · Answer

You can use the fgets   function to read the file line by line    handle   fopen  inputfile txt    r    if   handle        while    line   fgets  handle       false               process the line read             fclose  handle     else          error opening the file

User · Answer

You can use an object oriented interface class  for a file - SplFileObject http   php net manual en splfileobject fgets php  PHP 5    5 1 0    lt  php   file   new SplFileObject  file txt        Loop until we reach the end of the file  while    file- gt eof             Echo one line from the file      echo  file- gt fgets          Unset the file to call   destruct    closing the file handle   file   null

User · Answer

I would strongly recommend not using the default file loading as it is horrendously slow. You should look into the numpy functions and the IOpro functions (e.g. numpy.loadtxt()).

http://docs.scipy.org/doc/numpy/user/basics.io.genfromtxt.html

https://store.continuum.io/cshop/iopro/

Then you can break your pairwise operation into chunks:

import numpy as np
import math

lines_total = n    
similarity = np.zeros(n,n)
lines_per_chunk = m
n_chunks = math.ceil(float(n)/m)
for i in xrange(n_chunks):
    for j in xrange(n_chunks):
        chunk_i = (function of your choice to read lines i*lines_per_chunk to (i+1)*lines_per_chunk)
        chunk_j = (function of your choice to read lines j*lines_per_chunk to (j+1)*lines_per_chunk)
        similarity[i*lines_per_chunk:(i+1)*lines_per_chunk,
                   j*lines_per_chunk:(j+1)*lines_per_chunk] = fast_operation(chunk_i, chunk_j)

It's almost always much faster to load data in chunks and then do matrix operations on it than to do it element by element!!

User · Answer

Using a text file for the example with open  yourFile txt   r   as f      text   f readlines   for line in text      print line    Open your file for reading  r  Read the whole file and save each line into a list  text  Loop through the list printing each line    If you want  for example  to check a specific line for a length greater than 10  work with what you already have available   for line in text      if len line   gt  10          print line

User · Answer

SplFileObject is useful when it comes to dealing with large files   function parse file  filename        try            file   new SplFileObject  filename         catch  LogicException  exception            die  SplFileObject      exception- gt getMessage               while   file- gt valid               line    file- gt fgets              do something with  line              don t forget to free the file handle       file   null

User · Answer

Best way to read large file  line by line is to use python enumerate function  with open file name   rU   as read file      for i  row in enumerate read file  1            do something          i in line of that line          row containts all data of that line

User · Answer

Function to Read with array return  function read file  filename             buffer   array         source file   fopen   filename   r    or die  Couldn t open  filename        while   feof  source file              buffer     fread  source file  4096       use a buffer of 4KB           return  buffer

User · Answer

Use buffering techniques to read the file    filename    test txt    source file   fopen   filename   r    or die  Couldn t open  filename    while   feof  source file          buffer   fread  source file  4096       use a buffer of 4KB      buffer   str replace  old  new  buffer

User · Answer

There is a file   function that returns an array of the lines contained in the file   foreach file  myfile txt   as  line       echo  line    n

User · Answer

this is a possible way of reading a file in python   f   open input file  for line in f      do stuff line  f close     it does not allocate a full list  It iterates over the lines

[php] How to read a large file line by line?

The answer is

To strip newlines:

To preserve native line terminators:

1. use of `with`

2. use of `yield`

References

Need to frequently read a large file from last position reading ?

Examples related to php

Tags

[php] How to read a large file line by line?

The answer is

To strip newlines:

To preserve native line terminators:

1. use of with

2. use of yield

References

Need to frequently read a large file from last position reading ?

Examples related to php

Tags

1. use of `with`

2. use of `yield`