[linux] Count lines in large files

I commonly work with text files of ~20 Gb size and I find myself counting the number of lines in a given file very often.

The way I do it now it's just cat fname | wc -l, and it takes very long. Is there any solution that'd be much faster?

I work in a high performance cluster with Hadoop installed. I was wondering if a map reduce approach could help.

I'd like the solution to be as simple as one line run, like the wc -l solution, but not sure how feasible it is.

Any ideas?

This question is related to linux mapreduce

The answer is


Hadoop is essentially providing a mechanism to perform something similar to what @Ivella is suggesting.

Hadoop's HDFS (Distributed file system) is going to take your 20GB file and save it across the cluster in blocks of a fixed size. Lets say you configure the block size to be 128MB, the file would be split into 20x8x128MB blocks.

You would then run a map reduce program over this data, essentially counting the lines for each block (in the map stage) and then reducing these block line counts into a final line count for the entire file.

As for performance, in general the bigger your cluster, the better the performance (more wc's running in parallel, over more independent disks), but there is some overhead in job orchestration that means that running the job on smaller files will not actually yield quicker throughput than running a local wc


I know the question is a few years old now, but expanding on Ivella's last idea, this bash script estimates the line count of a big file within seconds or less by measuring the size of one line and extrapolating from it:

#!/bin/bash
head -2 $1 | tail -1 > $1_oneline
filesize=$(du -b $1 | cut -f -1)
linesize=$(du -b $1_oneline | cut -f -1)
rm $1_oneline
echo $(expr $filesize / $linesize)

If you name this script lines.sh, you can call lines.sh bigfile.txt to get the estimated number of lines. In my case (about 6 GB, export from database), the deviation from the true line count was only 3%, but ran about 1000 times faster. By the way, I used the second, not first, line as the basis, because the first line had column names and the actual data started in the second line.


If your data resides on HDFS, perhaps the fastest approach is to use hadoop streaming. Apache Pig's COUNT UDF, operates on a bag, and therefore uses a single reducer to compute the number of rows. Instead you can manually set the number of reducers in a simple hadoop streaming script as follows:

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar -Dmapred.reduce.tasks=100 -input <input_path> -output <output_path> -mapper /bin/cat -reducer "wc -l"

Note that I manually set the number of reducers to 100, but you can tune this parameter. Once the map-reduce job is done, the result from each reducer is stored in a separate file. The final count of rows is the sum of numbers returned by all reducers. you can get the final count of rows as follows:

$HADOOP_HOME/bin/hadoop fs -cat <output_path>/* | paste -sd+ | bc

I have a 645GB text file, and none of the earlier exact solutions (e.g. wc -l) returned an answer within 5 minutes.

Instead, here is Python script that computes the approximate number of lines in a huge file. (My text file apparently has about 5.5 billion lines.) The Python script does the following:

A. Counts the number of bytes in the file.

B. Reads the first N lines in the file (as a sample) and computes the average line length.

C. Computes A/B as the approximate number of lines.

It follows along the line of Nico's answer, but instead of taking the length of one line, it computes the average length of the first N lines.

Note: I'm assuming an ASCII text file, so I expect the Python len() function to return the number of chars as the number of bytes.

Put this code into a file line_length.py:

#!/usr/bin/env python

# Usage:
# python line_length.py <filename> <N> 

import os
import sys
import numpy as np

if __name__ == '__main__':

    file_name = sys.argv[1]
    N = int(sys.argv[2]) # Number of first lines to use as sample.
    file_length_in_bytes = os.path.getsize(file_name)
    lengths = [] # Accumulate line lengths.
    num_lines = 0

    with open(file_name) as f:
        for line in f:
            num_lines += 1
            if num_lines > N:
                break
            lengths.append(len(line))

    arr = np.array(lengths)
    lines_count = len(arr)
    line_length_mean = np.mean(arr)
    line_length_std = np.std(arr)

    line_count_mean = file_length_in_bytes / line_length_mean

    print('File has %d bytes.' % (file_length_in_bytes))
    print('%.2f mean bytes per line (%.2f std)' % (line_length_mean, line_length_std))
    print('Approximately %d lines' % (line_count_mean))

Invoke it like this with N=5000.

% python line_length.py big_file.txt 5000

File has 645620992933 bytes.
116.34 mean bytes per line (42.11 std)
Approximately 5549547119 lines

So there are about 5.5 billion lines in the file.


find  -type f -name  "filepattern_2015_07_*.txt" -exec ls -1 {} \; | cat | awk '//{ print $0 , system("cat " $0 "|" "wc -l")}'

Output:


Your limiting speed factor is the I/O speed of your storage device, so changing between simple newlines/pattern counting programs won't help, because the execution speed difference between those programs are likely to be suppressed by the way slower disk/storage/whatever you have.

But if you have the same file copied across disks/devices, or the file is distributed among those disks, you can certainly perform the operation in parallel. I don't know specifically about this Hadoop, but assuming you can read a 10gb the file from 4 different locations, you can run 4 different line counting processes, each one in one part of the file, and sum their results up:

$ dd bs=4k count=655360 if=/path/to/copy/on/disk/1/file | wc -l &
$ dd bs=4k skip=655360 count=655360 if=/path/to/copy/on/disk/2/file | wc -l &
$ dd bs=4k skip=1310720 count=655360 if=/path/to/copy/on/disk/3/file | wc -l &
$ dd bs=4k skip=1966080 if=/path/to/copy/on/disk/4/file | wc -l &

Notice the & at each command line, so all will run in parallel; dd works like cat here, but allow us to specify how many bytes to read (count * bs bytes) and how many to skip at the beginning of the input (skip * bs bytes). It works in blocks, hence, the need to specify bs as the block size. In this example, I've partitioned the 10Gb file in 4 equal chunks of 4Kb * 655360 = 2684354560 bytes = 2.5GB, one given to each job, you may want to setup a script to do it for you based on the size of the file and the number of parallel jobs you will run. You need also to sum the result of the executions, what I haven't done for my lack of shell script ability.

If your filesystem is smart enough to split big file among many devices, like a RAID or a distributed filesystem or something, and automatically parallelize I/O requests that can be paralellized, you can do such a split, running many parallel jobs, but using the same file path, and you still may have some speed gain.

EDIT: Another idea that occurred to me is, if the lines inside the file have the same size, you can get the exact number of lines by dividing the size of the file by the size of the line, both in bytes. You can do it almost instantaneously in a single job. If you have the mean size and don't care exactly for the the line count, but want an estimation, you can do this same operation and get a satisfactory result much faster than the exact operation.


I'm not sure that python is quicker:

[root@myserver scripts]# time python -c "print len(open('mybigfile.txt').read().split('\n'))"

644306


real    0m0.310s
user    0m0.176s
sys     0m0.132s

[root@myserver scripts]# time  cat mybigfile.txt  | wc -l

644305


real    0m0.048s
user    0m0.017s
sys     0m0.074s

Let us assume:

  • Your file system is distributed
  • Your file system can easily fill the network connection to a single node
  • You access your files like normal files

then you really want to chop the files into parts, count parts in parallel on multiple nodes and sum up the results from there (this is basically @Chris White's idea).

Here is how you do that with GNU Parallel (version > 20161222). You need to list the nodes in ~/.parallel/my_cluster_hosts and you must have ssh access to all of them:

parwc() {
    # Usage:
    #   parwc -l file                                                                

    # Give one chunck per host                                                     
    chunks=$(cat ~/.parallel/my_cluster_hosts|wc -l)
    # Build commands that take a chunk each and do 'wc' on that                    
    # ("map")                                                                      
    parallel -j $chunks --block -1 --pipepart -a "$2" -vv --dryrun wc "$1" |
        # For each command                                                         
        #   log into a cluster host                                                
        #   cd to current working dir                                              
        #   execute the command                                                    
        parallel -j0 --slf my_cluster_hosts --wd . |
        # Sum up the number of lines                                               
        # ("reduce")                                                               
        perl -ne '$sum += $_; END { print $sum,"\n" }'
}

Use as:

parwc -l myfile
parwc -w myfile
parwc -c myfile

On a multi-core server, use GNU parallel to count file lines in parallel. After each files line count is printed, bc sums all line counts.

find . -name '*.txt' | parallel 'wc -l {}' 2>/dev/null | paste -sd+ - | bc

To save space, you can even keep all files compressed. The following line uncompresses each file and counts its lines in parallel, then sums all counts.

find . -name '*.xz' | parallel 'xzcat {} | wc -l' 2>/dev/null | paste -sd+ - | bc

If your computer has python, you can try this from the shell:

python -c "print len(open('test.txt').read().split('\n'))"

This uses python -c to pass in a command, which is basically reading the file, and splitting by the "newline", to get the count of newlines, or the overall length of the file.

@BlueMoon's:

bash-3.2$ sed -n '$=' test.txt
519

Using the above:

bash-3.2$ python -c "print len(open('test.txt').read().split('\n'))"
519

As per my test, I can verify that the Spark-Shell (based on Scala) is way faster than the other tools (GREP, SED, AWK, PERL, WC). Here is the result of the test that I ran on a file which had 23782409 lines

time grep -c $ my_file.txt;

real 0m44.96s user 0m41.59s sys 0m3.09s

time wc -l my_file.txt;

real 0m37.57s user 0m33.48s sys 0m3.97s

time sed -n '$=' my_file.txt;

real 0m38.22s user 0m28.05s sys 0m10.14s

time perl -ne 'END { $_=$.;if(!/^[0-9]+$/){$_=0;};print "$_" }' my_file.txt;

real 0m23.38s user 0m20.19s sys 0m3.11s

time awk 'END { print NR }' my_file.txt;

real 0m19.90s user 0m16.76s sys 0m3.12s

spark-shell
import org.joda.time._
val t_start = DateTime.now()
sc.textFile("file://my_file.txt").count()
val t_end = DateTime.now()
new Period(t_start, t_end).toStandardSeconds()

res1: org.joda.time.Seconds = PT15S


If your bottleneck is the disk, it matters how you read from it. dd if=filename bs=128M | wc -l is a lot faster than wc -l filename or cat filename | wc -l for my machine that has an HDD and fast CPU and RAM. You can play around with the block size and see what dd reports as the throughput. I cranked it up to 1GiB.

Note: There is some debate about whether cat or dd is faster. All I claim is that dd can be faster, depending on the system, and that it is for me. Try it for yourself.