Count lines in large files

Question

I commonly work with text files of  20 Gb size and I find myself counting the number of lines in a given file very often   The way I do it now it s just cat fname   wc -l  and it takes very long   Is there any solution that d be much faster   I work in a high performance cluster with Hadoop installed  I was wondering if a map reduce approach could help   I d like the solution to be as simple as one line run  like the wc -l solution  but not sure how feasible it is   Any ideas

User · Answer

As per my test  I can verify that the Spark-Shell  based on Scala  is way faster than the other tools  GREP  SED  AWK  PERL  WC   Here is the result of the test that I ran on a file which had 23782409 lines  time grep -c   my file txt    real    0m44 96s user    0m41 59s sys     0m3 09s  time wc -l my file txt    real    0m37 57s user    0m33 48s sys     0m3 97s  time sed -n      my file txt    real    0m38 22s user    0m28 05s sys     0m10 14s  time perl -ne  END         if     0-9         0   print         my file txt   real    0m23 38s user    0m20 19s sys     0m3 11s  time awk  END   print NR    my file txt    real    0m19 90s user    0m16 76s sys     0m3 12s  spark-shell import org joda time   val t start   DateTime now   sc textFile  file   my file txt   count   val t end   DateTime now   new Period t start  t end  toStandardSeconds     res1  org joda time Seconds   PT15S

User · Answer

I know the question is a few years old now  but expanding on Ivella s last idea  this bash script estimates the line count of a big file within seconds or less by measuring the size of one line and extrapolating from it      bin bash head -2  1   tail -1  gt   1 oneline filesize   du -b  1   cut -f -1  linesize   du -b  1 oneline   cut -f -1  rm  1 oneline echo   expr  filesize    linesize    If you name this script lines sh  you can call lines sh bigfile txt to get the estimated number of lines  In my case  about 6 GB  export from database   the deviation from the true line count was only 3   but ran about 1000 times faster  By the way  I used the second  not first  line as the basis  because the first line had column names and the actual data started in the second line

User · Answer

Let us assume    Your file system is distributed Your file system can easily fill the network connection to a single node You access your files like normal files   then you really want to chop the files into parts  count parts in parallel on multiple nodes and sum up the results from there  this is basically  Chris White s idea    Here is how you do that with GNU Parallel  version   20161222   You need to list the nodes in    parallel my cluster hosts and you must have ssh access to all of them   parwc           Usage          parwc -l file                                                                        Give one chunck per host                                                          chunks   cat    parallel my cluster hosts wc -l        Build commands that take a chunk each and do  wc  on that                             map                                                                             parallel -j  chunks --block -1 --pipepart -a   2  -vv --dryrun wc   1              For each command                                                                      log into a cluster host                                                             cd to current working dir                                                           execute the command                                                             parallel -j0 --slf my cluster hosts --wd               Sum up the number of lines                                                            reduce                                                                          perl -ne   sum        END   print  sum   n         Use as   parwc -l myfile parwc -w myfile parwc -c myfile

User · Answer

Hadoop is essentially providing a mechanism to perform something similar to what  Ivella is suggesting    Hadoop s HDFS  Distributed file system  is going to take your 20GB file and save it across the cluster in blocks of a fixed size  Lets say you configure the block size to be 128MB  the file would be split into 20x8x128MB blocks   You would then run a map reduce program over this data  essentially counting the lines for each block  in the map stage  and then reducing these block line counts into a final line count for the entire file   As for performance  in general the bigger your cluster  the better the performance  more wc s running in parallel  over more independent disks   but there is some overhead in job orchestration that means that running the job on smaller files will not actually yield quicker throughput than running a local wc

User · Answer

If your bottleneck is the disk  it matters how you read from it  dd if filename bs 128M   wc -l is a lot faster than wc -l filename or cat filename   wc -l for my machine that has an HDD and fast CPU and RAM  You can play around with the block size and see what dd reports as the throughput  I cranked it up to 1GiB   Note  There is some debate about whether cat or dd is faster  All I claim is that dd can be faster  depending on the system  and that it is for me  Try it for yourself

User · Answer

If your data resides on HDFS  perhaps the fastest approach is to use hadoop streaming  Apache Pig s COUNT UDF  operates on a bag  and therefore uses a single reducer to compute the number of rows  Instead you can manually set the number of reducers in a simple hadoop streaming script as follows    HADOOP HOME bin hadoop jar  HADOOP HOME hadoop-streaming jar -Dmapred reduce tasks 100 -input  lt input path gt  -output  lt output path gt  -mapper  bin cat -reducer  wc -l    Note that I manually set the number of reducers to 100  but you can tune this parameter  Once the map-reduce job is done  the result from each reducer is stored in a separate file  The final count of rows is the sum of numbers returned by all reducers  you can get the final count of rows as follows    HADOOP HOME bin hadoop fs -cat  lt output path gt      paste -sd    bc

User · Answer

find  -type f -name   filepattern 2015 07   txt  -exec ls -1         cat   awk      print  0   system  cat    0      wc -l       Output

User · Answer

Try  sed -n      filename  Also cat is unnecessary  wc -l filename is enough in your present way

User · Answer

If your computer has python  you can try this from the shell   python -c  print len open  test txt   read   split   n       This uses python -c to pass in a command  which is basically reading the file  and splitting by the  newline   to get the count of newlines  or the overall length of the file    BlueMoon s   bash-3 2  sed -n      test txt 519   Using the above   bash-3 2  python -c  print len open  test txt   read   split   n     519

User · Answer

I have a 645GB text file  and none of the earlier exact solutions  e g  wc -l  returned an answer within 5 minutes    Instead  here is Python script that computes the approximate number of lines in a huge file   My text file apparently has about 5 5 billion lines   The Python script does the following   A  Counts the number of bytes in the file   B  Reads the first N lines in the file  as a sample  and computes the average line length   C  Computes A B as the approximate number of lines   It follows along the line of Nico s answer  but instead of taking the length of one line  it computes the average length of the first N lines   Note  I m assuming an ASCII text file  so I expect the Python len   function to return the number of chars as the number of bytes   Put this code into a file line length py      usr bin env python    Usage    python line length py  lt filename gt   lt N gt    import os import sys import numpy as np  if   name         main          file name   sys argv 1      N   int sys argv 2     Number of first lines to use as sample      file length in bytes   os path getsize file name      lengths        Accumulate line lengths      num lines   0      with open file name  as f          for line in f              num lines    1             if num lines  gt  N                  break             lengths append len line        arr   np array lengths      lines count   len arr      line length mean   np mean arr      line length std   np std arr       line count mean   file length in bytes   line length mean      print  File has  d bytes      file length in bytes       print    2f mean bytes per line    2f std      line length mean  line length std       print  Approximately  d lines     line count mean     Invoke it like this with N 5000     python line length py big file txt 5000  File has 645620992933 bytes  116 34 mean bytes per line  42 11 std  Approximately 5549547119 lines   So there are about 5 5 billion lines in the file

User · Answer

I m not sure that python is quicker    root myserver scripts   time python -c  print len open  mybigfile txt   read   split   n      644306   real    0m0 310s user    0m0 176s sys     0m0 132s   root myserver scripts   time  cat mybigfile txt    wc -l  644305   real    0m0 048s user    0m0 017s sys     0m0 074s

User · Answer

On a multi-core server  use GNU parallel to count file lines in parallel  After each files line count is printed  bc sums all line counts   find   -name    txt    parallel  wc -l     2 gt  dev null   paste -sd  -   bc   To save space  you can even keep all files compressed  The following line uncompresses each file and counts its lines in parallel  then sums all counts   find   -name    xz    parallel  xzcat      wc -l  2 gt  dev null   paste -sd  -   bc

User · Answer

Your limiting speed factor is the I O speed of your storage device  so changing between simple newlines pattern counting programs won t help  because the execution speed difference between those programs are likely to be suppressed by the way slower disk storage whatever you have   But if you have the same file copied across disks devices  or the file is distributed among those disks  you can certainly perform the operation in parallel  I don t know specifically about this Hadoop  but assuming you can read a 10gb the file from 4 different locations  you can run 4 different line counting processes  each one in one part of the file  and sum their results up     dd bs 4k count 655360 if  path to copy on disk 1 file   wc -l  amp    dd bs 4k skip 655360 count 655360 if  path to copy on disk 2 file   wc -l  amp    dd bs 4k skip 1310720 count 655360 if  path to copy on disk 3 file   wc -l  amp    dd bs 4k skip 1966080 if  path to copy on disk 4 file   wc -l  amp    Notice the  amp  at each command line  so all will run in parallel  dd works like cat here  but allow us to specify how many bytes to read  count   bs bytes  and how many to skip at the beginning of the input  skip   bs bytes   It works in blocks  hence  the need to specify bs as the block size  In this example  I ve partitioned the 10Gb file in 4 equal chunks of 4Kb   655360   2684354560 bytes   2 5GB  one given to each job  you may want to setup a script to do it for you based on the size of the file and the number of parallel jobs you will run  You need also to sum the result of the executions  what I haven t done for my lack of shell script ability   If your filesystem is smart enough to split big file among many devices  like a RAID or a distributed filesystem or something  and automatically parallelize I O requests that can be paralellized  you can do such a split  running many parallel jobs  but using the same file path  and you still may have some speed gain   EDIT  Another idea that occurred to me is  if the lines inside the file have the same size  you can get the exact number of lines by dividing the size of the file by the size of the line  both in bytes  You can do it almost instantaneously in a single job  If you have the mean size and don t care exactly for the the line count  but want an estimation  you can do this same operation and get a satisfactory result much faster than the exact operation

[linux] Count lines in large files

Examples related to linux

Examples related to mapreduce