What is the most efficient way to get first and last line of a text file

Question

I have a text file which contains a time stamp on each line  My goal is to find the time range  All the times are in order so the first line will be the earliest time and the last line will be the latest time  I only need the very first and very last line  What would be the most efficient way to get these lines in python   Note  These files are relatively large in length  about 1-2 million lines each and I have to do this for several hundred files

User · Accepted Answer

docs for io module  with open fname   rb   as fh      first   next fh  decode        fh seek -1024  2      last   fh readlines   -1  decode     The variable value here is 1024  it represents the average string length  I choose 1024 only for example  If you have an estimate of average line length you could just use that value times 2   Since you have no idea whatsoever about the possible upper bound for the line length  the obvious solution would be to loop over the file   for line in fh      pass last   line   You don t need to bother with the binary flag you could just use open fname    ETA  Since you have many files to work on  you could create a sample of couple of dozens of files using random sample and run this code on them to determine length of last line  With an a priori large value of the position shift  let say 1 MB   This will help you to estimate the value for the full run

User · Answer

with open  myfile txt   as f      lines   f readlines       first row   lines 0      print first row     last row   lines -1      print last row

User · Answer

Here s a modified version of SilentGhost s answer that will do what you want   with open fname   rb   as fh      first   next fh      offs   -100     while True          fh seek offs  2          lines   fh readlines           if len lines  gt 1              last   lines -1              break         offs    2     print first     print last   No need for an upper bound for line length here

User · Answer

Can you use unix commands  I think using head -1 and tail -n 1 are probably the most efficient methods  Alternatively  you could use a simple fid readline   to get the first line and fid readlines   -1   but that may take too much memory

User · Answer

with open filename   rb   as f  Needs to be in binary mode for the seek from the end to work     first   f readline       if f read 1                 return first     f seek -2  2     Jump to the second last byte      while f read 1     b  n      Until EOL is found            f seek -2  1        jump back the read byte plus one more      last   f readline      Read last line      return last   The above answer is a modified version of the above answers which handles the case that there is only one line in the file

User · Answer

w open file txt   r   print   first line is     w readline    for line in w        x  line print   last line is     x  w close     The for loop runs through the lines and x gets the last line on the final iteration

User · Answer

Here is an extension of  Trasp s answer that has additional logic for handling the corner case of a file that has only one line  It may be useful to handle this case if you repeatedly want to read the last line of a file that is continuously being updated  Without this  if you try to grab the last line of a file that has just been created and has only one line  IOError   Errno 22  Invalid argument will be raised   def tail filepath       with open filepath   rb   as f          first   f readline          Read the first line          f seek -2  2                Jump to the second last byte          while f read 1     b  n     Until EOL is found                try                  f seek -2  1           jump back the read byte plus one more              except IOError                  f seek -1  1                  if f tell      0                      break         last   f readline           Read last line      return last

User · Answer

Nobody mentioned using reversed   f open file  r   r reversed f readlines    last line of file   r next

User · Answer

First open the file in read mode Then use readlines   method to read line by line All the lines stored in a list Now you can use list slices to get first and last lines of the file       a open  file txt   rb       lines   a readlines       if lines          first line   lines  1          last line   lines -1

User · Answer

Getting the first line is trivially easy  For the last line  presuming you know an approximate upper bound on the line length  os lseek some amount from SEEK END find the second to last line ending and then readline   the last line

User · Answer

This is my solution  compatible also with Python3  It does also manage border cases  but it misses utf-16 support   def tail filepath                author Marco Sulla  marcosullaroma gmail com       date May 31  2016              try          filepath is file         fp   str filepath      except AttributeError          fp   filepath      with open fp   rb   as f          size   os stat fp  st size         start pos   0 if size - 1  lt  0 else size - 1          if start pos    0              f seek start pos              char   f read 1               if char    b  n                   start pos -  1                 f seek start pos               if start pos    0                  f seek start pos              else                  char                       for pos in range start pos  -1  -1                       f seek pos                       char   f read 1                       if char    b  n                           break          return f readline     It s ispired by Trasp s answer and AnotherParker s comment

User · Answer

To read both the first and final line of a file you could     open the file          read the first line using built-in readline            seek  move the cursor  to the end of the file          step backwards until you encounter EOL  line break  and         read the last line from there   def readlastline f       f seek -2  2                 Jump to the second last byte      while f read 1     b quot  n quot      Until EOL is found             f seek -2  1                 jump back  over the read byte plus one more      return f read                Read all data from this point on       with open file   quot rb quot   as f      first   f readline       last   readlastline f   Jump to the second last byte directly to prevent trailing newline characters to cause empty lines to be returned   The current offset is pushed ahead by one every time a byte is read so the stepping backwards is done two bytes at a time  past the recently read byte and the byte to read next  The whence parameter passed to fseek offset  whence 0  indicates that fseek should seek to a position offset bytes relative to     0 or os SEEK SET   The beginning of the file  1 or os SEEK CUR   The current position  2 or os SEEK END   The end of the file     As would be expected as the default behavior of most applications  including print and echo  is to append one to every line written and has no effect on lines missing trailing newline character   Efficiency  1-2 million lines each and I have to do this for several hundred files   I timed this method and compared it against against the top answer  10k iterations processing a file of 6k lines totalling 200kB  1 62s vs 6 92s  100 iterations processing a file of 6k lines totalling 1 3GB  8 93s vs 86 95   Millions of lines would increase the difference a lot more  Exakt code used for timing  with open file   quot rb quot   as f      first   f readline         Read and store the first line      for last in f  pass        Read all lines  keep final value    Amendment A more complex  and harder to read  variation to address comments and issues raised since   Return empty string when parsing empty file  raised by comment  Return all content when no delimiter is found  raised by comment  Avoid relative offsets to support text mode  raised by comment  UTF16 UTF32 hack  noted by comment   Also adds support for multibyte delimiters  readlast b X lt br gt Y   b  lt br gt    fixed False   Please note that this variation is really slow for large files because of the non-relative offsets needed in text mode  Modify to your need  or do not use it at all as you re probably better off using f readlines   -1  with files opened in text mode     bin python3  from os import SEEK END  def readlast f  sep  fixed True       r quot  quot  quot Read the last segment from a file-like object        param f  File to read last line from       type  f  file-like object      param sep  Segment separator  delimiter        type  sep  bytes  str      param fixed  Treat data in   f   as a chain of fixed size blocks       type  fixed  bool      returns  Last line of file       rtype  bytes  str      quot  quot  quot      bs     len sep      step   bs if fixed else 1     if not bs          raise ValueError  quot Zero-length separator  quot       try          o   f seek 0  SEEK END          o   f seek o-bs-step       - Ignore trailing delimiter  sep           while f read bs     sep    - Until reaching  sep   Read sep-sized block             o   f seek o-step       and then seek to the block to read next      except  OSError ValueError     - Beginning of file reached          f seek 0      return f read    def test readlast        from io import BytesIO  StringIO            Text mode      f   StringIO  quot first nlast n quot       assert readlast f   quot  n quot       quot last n quot             Bytes      f   BytesIO b first last       assert readlast f  b        b last             Bytes  UTF-8      f   BytesIO  quot X nY n quot  encode  quot utf-8 quot        assert readlast f  b  n   decode       quot Y n quot             Bytes  UTF-16      f   BytesIO  quot X nY n quot  encode  quot utf-16 quot        assert readlast f  b  n x00   decode  utf-16       quot Y n quot           Bytes  UTF-32      f   BytesIO  quot X nY n quot  encode  quot utf-32 quot        assert readlast f  b  n x00 x00 x00   decode  utf-32       quot Y n quot             Multichar delimiter      f   StringIO  quot X lt br gt Y quot       assert readlast f   quot  lt br gt  quot   fixed False      quot Y quot             Make sure you use the correct delimiters      seps      utf8   b  n    utf16   b  n x00    utf32   b  n x00 x00 x00        assert  quot  n quot  encode  utf8           seps  utf8       assert  quot  n quot  encode  utf16   2      seps  utf16       assert  quot  n quot  encode  utf32   4      seps  utf32              Edge cases      edges               Text   Match           quot  quot        quot  quot        Empty file  empty string            quot X quot       quot X quot       No delimiter  full content            quot  n quot      quot  n quot              quot  n n quot    quot  n quot              UTF16 32 encoded U 270A  b quot  n x00 n  n x00 quot  utf16           b  n xe2 x9c x8a n  decode    b  xe2 x9c x8a n  decode               for txt  match in edges          for enc sep in seps items                assert readlast BytesIO txt encode enc    sep  decode enc     match  if   name       quot   main   quot       import sys     for path in sys argv 1            with open path  as f              print f readline        end  quot  quot               print readlast f  quot  n quot    end  quot  quot

[python] What is the most efficient way to get first and last line of a text file?

Examples related to python

Examples related to file

Examples related to seek