I need to extract the last line from a number of very large (several hundred megabyte) text files to get certain data. Currently, I am using python to cycle through all the lines until the file is empty and then I process the last line returned, but I am certain there is a more efficient way to do this.
What is the best way to retrieve just the last line of a text file using python?
Use the file's seek
method with a negative offset and whence=os.SEEK_END
to read a block from the end of the file. Search that block for the last line end character(s) and grab all the characters after it. If there is no line end, back up farther and repeat the process.
def last_line(in_file, block_size=1024, ignore_ending_newline=False):
suffix = ""
in_file.seek(0, os.SEEK_END)
in_file_length = in_file.tell()
seek_offset = 0
while(-seek_offset < in_file_length):
# Read from end.
seek_offset -= block_size
if -seek_offset > in_file_length:
# Limit if we ran out of file (can't seek backward from start).
block_size -= -seek_offset - in_file_length
if block_size == 0:
break
seek_offset = -in_file_length
in_file.seek(seek_offset, os.SEEK_END)
buf = in_file.read(block_size)
# Search for line end.
if ignore_ending_newline and seek_offset == -block_size and buf[-1] == '\n':
buf = buf[:-1]
pos = buf.rfind('\n')
if pos != -1:
# Found line end.
return buf[pos+1:] + suffix
suffix = buf + suffix
# One-line file.
return suffix
Note that this will not work on things that don't support seek
, like stdin or sockets. In those cases, you're stuck reading the whole thing (like the tail
command does).
The inefficiency here is not really due to Python, but to the nature of how files are read. The only way to find the last line is to read the file in and find the line endings. However, the seek operation may be used to skip to any byte offset in the file. You can, therefore begin very close to the end of the file, and grab larger and larger chunks as needed until the last line ending is found:
from os import SEEK_END
def get_last_line(file):
CHUNK_SIZE = 1024 # Would be good to make this the chunk size of the filesystem
last_line = ""
while True:
# We grab chunks from the end of the file towards the beginning until we
# get a new line
file.seek(-len(last_line) - CHUNK_SIZE, SEEK_END)
chunk = file.read(CHUNK_SIZE)
if not chunk:
# The whole file is one big line
return last_line
if not last_line and chunk.endswith('\n'):
# Ignore the trailing newline at the end of the file (but include it
# in the output).
last_line = '\n'
chunk = chunk[:-1]
nl_pos = chunk.rfind('\n')
# What's being searched for will have to be modified if you are searching
# files with non-unix line endings.
last_line = chunk[nl_pos + 1:] + last_line
if nl_pos == -1:
# The whole chunk is part of the last line.
continue
return last_line
Not the straight forward way, but probably much faster than a simple Python implementation:
line = subprocess.check_output(['tail', '-1', filename])
lines = file.readlines()
fileHandle.close()
last_line = lines[-1]
Seek to the end of the file minus 100 bytes or so. Do a read and search for a newline. If here is no newline, seek back another 100 bytes or so. Lather, rinse, repeat. Eventually you'll find a newline. The last line begins immediately after that newline.
Best case scenario you only do one read of 100 bytes.
If you can pick a reasonable maximum line length, you can seek to nearly the end of the file before you start reading.
myfile.seek(-max_line_length, os.SEEK_END)
line = myfile.readlines()[-1]
If you do know the maximal length of a line, you can do
def getLastLine(fname, maxLineLength=80):
fp=file(fname, "rb")
fp.seek(-maxLineLength-1, 2) # 2 means "from the end of the file"
return fp.readlines()[-1]
This works on my windows machine. But I do not know what happens on other platforms if you open a text file in binary mode. The binary mode is needed if you want to use seek().
with open('output.txt', 'r') as f:
lines = f.read().splitlines()
last_line = lines[-1]
print last_line
Could you load the file into a mmap, then use mmap.rfind(string[, start[, end]]) to find the second last EOL character in the file? A seek to that point in the file should point you to the last line I would think.
Here's a slightly different solution. Instead of multi-line, I focused on just the last line, and instead of a constant block size, I have a dynamic (doubling) block size. See comments for more info.
# Get last line of a text file using seek method. Works with non-constant block size.
# IDK if that speed things up, but it's good enough for us,
# especially with constant line lengths in the file (provided by len_guess),
# in which case the block size doubling is not performed much if at all. Currently,
# we're using this on a textfile format with constant line lengths.
# Requires that the file is opened up in binary mode. No nonzero end-rel seeks in text mode.
REL_FILE_END = 2
def lastTextFileLine(file, len_guess=1):
file.seek(-1, REL_FILE_END) # 1 => go back to position 0; -1 => 1 char back from end of file
text = file.read(1)
tot_sz = 1 # store total size so we know where to seek to next rel file end
if text != b'\n': # if newline is the last character, we want the text right before it
file.seek(0, REL_FILE_END) # else, consider the text all the way at the end (after last newline)
tot_sz = 0
blocks = [] # For storing succesive search blocks, so that we don't end up searching in the already searched
j = file.tell() # j = end pos
not_done = True
block_sz = len_guess
while not_done:
if j < block_sz: # in case our block doubling takes us past the start of the file (here j also = length of file remainder)
block_sz = j
not_done = False
tot_sz += block_sz
file.seek(-tot_sz, REL_FILE_END) # Yes, seek() works with negative numbers for seeking backward from file end
text = file.read(block_sz)
i = text.rfind(b'\n')
if i != -1:
text = text[i+1:].join(reversed(blocks))
return str(text)
else:
blocks.append(text)
block_sz <<= 1 # double block size (converge with open ended binary search-like strategy)
j = j - block_sz # if this doesn't work, try using tmp j1 = file.tell() above
return str(b''.join(reversed(blocks))) # if newline was never found, return everything read
Ideally, you'd wrap this in a class LastTextFileLine and keep track of a moving average of line lengths. This would give you a good len_guess maybe.
#!/usr/bin/python
count = 0
f = open('last_line1','r')
for line in f.readlines():
line = line.strip()
count = count + 1
print line
print count
f.close()
count1 = 0
h = open('last_line1','r')
for line in h.readlines():
line = line.strip()
count1 = count1 + 1
if count1 == count:
print line #-------------------- this is the last line
h.close()
Source: Stackoverflow.com