How to jump to a particular line in a huge text file

Question

Are there any alternatives to the code below   startFromLine   141978   or whatever line I need to jump to  urlsfile   open filename   rb   0   linesCounter   1  for line in urlsfile      if linesCounter  gt  startFromLine          DoSomethingWithThisLine line       linesCounter    1   If I m processing a huge text file   15MB  with lines of unknown but different length  and need to jump to a particular line which number I know in advance  I feel bad by processing them one by one when I know I could ignore at least first half of the file  Looking for more elegant solution if there is any

User · Answer

You don t really have that many options if the lines are of different length    you sadly need to process the line ending characters to know when you ve progressed to the next line   You can  however  dramatically speed this up AND reduce memory usage by changing the last parameter to  open  to something not 0   0 means the file reading operation is unbuffered  which is very slow and disk intensive  1 means the file is line buffered  which would be an improvement  Anything above 1  say 8k   ie  8096  or higher  reads chunks of the file into memory  You still access it through for line in open etc    but python only goes a bit at a time  discarding each buffered chunk after its processed

User · Answer

If you don t want to read the entire file in memory    you may need to come up with some format other than plain text   of course it all depends on what you re trying to do  and how often you will jump across the file   For instance  if you re gonna be jumping to lines many times in the same file  and you know that the file does not change while working with it  you can do this  First  pass through the whole file  and record the  seek-location  of some key-line-numbers  such as  ever 1000 lines   Then if you want line 12005  jump to the position of 12000  which you ve recorded  then read 5 lines and you ll know you re in line 12005 and so on

User · Answer

None of the answers are particularly satisfactory  so here s a small snippet to help   class LineSeekableFile      def   init   self  seekable           self fin   seekable         self line map   list     Map from line index - gt  file position          self line map append 0          while seekable readline                self line map append seekable tell         def   getitem   self  index             NOTE  This assumes that you re not reading the file sequentially              For that  just use  for line in file           self fin seek self line map index           return self fin readline      Example usage   In   cat  tmp test txt  Out  Line zero  Line one   Line three  End of file  line four   In  with open   tmp test txt    rt   as fin      seeker   LineSeekableFile fin          print seeker 1   Out  Line one    This involves doing a lot of file seeks  but is useful for the cases where you can t fit the whole file in memory   It does one initial read to get the line locations  so it does read the whole file  but doesn t keep it all in memory   and then each access does a file seek after the fact   I offer the snippet above under the MIT or Apache license at the discretion of the user

User · Answer

I am suprised no one mentioned islice  line   next itertools islice Fhandle index of interest index of interest 1  None    just the one line   or if you want the whole rest of the file  rest of file   itertools islice Fhandle index of interest  for line in rest of file      print line   or if you want every other line from the file  rest of file   itertools islice Fhandle index of interest None 2  for odd line in rest of file      print odd line

User · Answer

Since there is no way to determine the lenght of all lines without reading them  you have no choice but to iterate over all lines before your starting line  All you can do is to make it look nice  If the file is really huge then you might want to use a generator based approach   from itertools import dropwhile  def iterate from line f  start from line       return  l for i  l in dropwhile lambda x  x 0   lt  start from line  enumerate f     for line in iterate from line open filename   r   0   141978       DoSomethingWithThisLine line    Note  the index is zero based in this approach

User · Answer

linecache      The linecache module allows one to get any line from a Python source file  while attempting to optimize internally  using a cache  the common case where many lines are read from a single file  This is used by the traceback module to retrieve source lines for inclusion in the formatted traceback

User · Answer

I m probably spoiled by abundant ram  but 15 M is not huge  Reading into memory with readlines    is what I usually do with files of this size  Accessing a line after that is trivial

User · Answer

You may use mmap to find the offset of the lines  MMap seems to be the fastest way to process a file  example   with open  input file    r b   as f      mapped   mmap mmap f fileno    0  prot mmap PROT READ      i   1     for line in iter mapped readline               if i    Line I want to jump              offsets   mapped tell           i  1   then use f seek offsets  to move to the line you need

User · Answer

Do the lines themselves contain any index information   If the content of each line was something like   lt line index gt  Data   then the seek   approach could be used to do a binary search through the file  even if the amount of Data is variable   You d seek to the midpoint of the file  read a line  check whether its index is higher or lower than the one you want  etc   Otherwise  the best you can do is just readlines     If you don t want to read all 15MB  you can use the sizehint argument to at least replace a lot of readline  s with a smaller number of calls to readlines

User · Answer

You can t jump ahead without reading in the file at least once  since you don t know where the line breaks are   You could do something like     Read in the file once and build a list of line offsets line offset      offset   0 for line in file      line offset append offset      offset    len line  file seek 0     Now  to skip to line n  with the first line being line 0   just do file seek line offset n

User · Answer

What generates the file you want to process  If it is something under your control  you could generate an index  which line is at which position   at the time the file is appended to  The index file can be of fixed line size  space padded or 0 padded numbers  and will definitely be smaller  And thus can be read and processed qucikly     Which line do you want     Calculate byte offset of corresponding line number in index file possible because line size of index file is constant   Use seek or whatever to directly jump to get the line from index file  Parse to get byte offset for corresponding line of actual file

User · Answer

Can use this function to return line n   def skipton infile  n       with open infile  r   as fi          for i in range n-1               fi next           return fi next

User · Answer

If you know in advance the position in the file  rather the line number   you can use file seek   to go to that position   Edit  you can use the linecache getline filename  lineno  function  which will return the contents of the line lineno  but only after reading the entire file into memory  Good if you re randomly accessing lines from within the file  as python itself might want to do to print a traceback  but not good for a 15MB file

User · Answer

Here s an example using  readlines sizehint   to read a chunk of lines at a time  DNS pointed out that solution  I wrote this example because the other examples here are single-line oriented   def getlineno filename  lineno       if lineno  lt  1          raise TypeError  First line is line 1       f   open filename      lines read   0     while 1          lines   f readlines 100000          if not lines              return None         if lines read   len lines   gt   lineno              return lines lineno-lines read-1          lines read    len lines   print getlineno  nci 09425001 09450000 smi   12000

User · Answer

I have had the same problem  need to retrieve from huge file specific line    Surely  I can every time run through all records in file and stop it when counter will be equal to target line  but it does not work effectively in a case  when you want to obtain plural number of specific rows  That caused main issue to be resolved - how handle directly to necessary place of file   I found out next decision  Firstly I completed dictionary with start position of each line  key is line number  and value     cumulated length of previous lines    t   open file    r     dict pos       kolvo   0 length   0 for each in t      dict pos kolvo    length     length   length len each      kolvo   kolvo 1   ultimately  aim function   def give line line number       t seek dict pos get line number       line   t readline       return line   t seek line number      command that execute pruning of file up to line inception   So  if you next commit readline     you obtain your target line   Using such approach I have saved significant part of time

User · Answer

If you re dealing with a text file  amp  based on linux system  you could use the linux commands   For me  this worked well   import commands  def read line path  line 1       return commands getoutput  head - s  s   tail -1     line  path    line to jump   141978 read line  path to large text file   line to jump

[python] How to jump to a particular line in a huge text file?

Examples related to python

Examples related to text-files