Fast Linux file count for a large number of files

Question

I m trying to figure out the best way to find the number of files in a particular directory when there are a very large number of files  more than 100 000   When there are that many files  performing ls   wc -l takes quite a long time to execute  I believe this is because it s returning the names of all the files  I m trying to take up as little of the disk I O as possible  I have experimented with some shell and Perl scripts to no avail  How can I do it

User · Answer

The fastest way on Linux  the question is tagged as Linux   is to use a direct system call  Here s a little program that counts files  only  no directories  in a directory  You can count millions of files and it is around 2 5 times faster than  quot ls -f quot  and around 1 3-1 5 times faster than Christopher Schultz s answer   define  GNU SOURCE  include  lt dirent h gt   include  lt stdio h gt   include  lt fcntl h gt   include  lt stdlib h gt   include  lt sys syscall h gt    define BUF SIZE 4096  struct linux dirent       long d ino      off t d off      unsigned short d reclen      char d name        int countDir char  dir         int fd  nread  bpos  numFiles   0      char d type  buf BUF SIZE       struct linux dirent  dirEntry       fd   open dir  O RDONLY   O DIRECTORY       if  fd    -1            puts  quot open directory error quot            exit 3             while  1            nread   syscall SYS getdents  fd  buf  BUF SIZE           if  nread    -1                puts  quot getdents error quot                exit 1                     if  nread    0                break                     for  bpos   0  bpos  lt  nread                 dirEntry    struct linux dirent     buf   bpos               d type     buf   bpos   dirEntry- gt d reclen - 1               if  d type    DT REG                       Increase counter                 numFiles                              bpos    dirEntry- gt d reclen                      close fd        return numFiles     int main int argc  char   argv         if  argc    2            puts  quot Pass directory as parameter quot            return 2            printf  quot Number of files in  s   d n quot   argv 1   countDir argv 1         return 0     PS  It is not recursive  but you could modify it to achieve that

User · Answer

By default ls sorts the names  which can take a while if there are a lot of them   Also there will be no output until all of the names are read and sorted   Use the ls -f option to turn off sorting   ls -f   wc -l   Note that this will also enable -a  so        and other files starting with   will be counted

User · Answer

Use find  For example  find   -name  quot   ext quot    wc -l

User · Answer

I realized that not using in memory processing  when you have a huge amount of data  is faster than  quot piping quot  the commands  So I saved the result to a file and analyzed it afterwards  ls -1  path to dir  gt  count txt  amp  amp  cat count txt   wc -l

User · Answer

ls spends more time sorting the files names  Use -f to disable the sorting  which will save some time  ls -f   wc -l  Or you can use find  find   -type f   wc -l

User · Answer

You can change the output based on your requirements  but here is a Bash one-liner I wrote to recursively count and report the number of files in a series of numerically named directories  dir  tmp count these    for i in   ls -1   dir    sort -n      echo  quot  i   gt    find   dir   i  -type f   wc -l   quot      This looks recursively for all files  not directories  in the given directory and returns the results in a hash-like format   Simple tweaks to the find command could make what kind of files you re looking to count more specific  etc  It results in something like this  1   gt  38  65   gt  95052  66   gt  12823  67   gt  10572  69   gt  67275  70   gt  8105  71   gt  42052  72   gt  1184

User · Answer

You should use  quot getdents quot  in place of ls find Here is one very good article which described the getdents approach  http   be-n com spw you-can-list-a-million-files-in-a-directory-but-not-with-ls html Here is the extract  ls and practically every other method of listing a directory  including Python s os listdir and find    rely on libc readdir    However  readdir   only reads 32K of directory entries at a time  which means that if you have a lot of files in the same directory  e g   500 million directory entries  it is going to take an insanely long time to read all the directory entries  especially on a slow disk  For directories containing a large number of files  you ll need to dig deeper than tools that rely on readdir    You will need to use the getdents   system call directly  rather than helper methods from the C standard library  We can find the C code to list the files using getdents   from here  There are two modifications you will need to do in order quickly list all the files in a directory  First  increase the buffer size from X to something like 5 megabytes   define BUF SIZE 1024 1024 5  Then modify the main loop where it prints out the information about each file in the directory to skip entries with inode    0  I did this by adding if  dp- gt d ino    0  printf        In my case I also really only cared about the file names in the directory so I also rewrote the printf   statement to only print the filename  if d- gt d ino  printf  quot  sn  quot    char    d- gt d name    Compile it  it doesn t need any external libraries  so it s super simple to do  gcc listdir c -o listdir  Now just run   listdir  directory with an insane number of files

User · Answer

This answer here is faster than almost everything else on this page for very large  very nested directories   https   serverfault com a 691372 84703  locate -r       grep -c    PWD

User · Answer

You could try if using opendir   and readdir   in Perl is faster  For an example of those function  look here

User · Answer

Surprisingly for me  a bare-bones find is very much comparable to ls -f   gt  time ls -f my dir   wc -l 17626  real    0m0 015s user    0m0 011s sys     0m0 009s   versus   gt  time find my dir -maxdepth 1   wc -l 17625  real    0m0 014s user    0m0 008s sys     0m0 010s   Of course  the values on the third decimal place shift around a bit every time you execute any of these  so they re basically identical  Notice however that find returns one extra unit  because it counts the actual directory itself  and  as mentioned before  ls -f returns two extra units  since it also counts   and

User · Answer

The first 10 directories with the highest number of files  dir     for i in   ls -1   dir    sort -n      echo  quot   find   dir   i        -type f   wc -l    gt   i  quot       sort -nr   head -10

User · Answer

find  ls  and perl tested against 40 000 files has the same speed  though I didn t try to clear the cache    user server logs   time find     wc -l 42917  real    0m0 054s user    0m0 018s sys     0m0 040s   user server logs   time  bin ls -f   wc -l 42918  real    0m0 059s user    0m0 027s sys     0m0 037s  And with Perl s opendir and readdir  the same time   user server logs   time perl -e  opendir D   quot   quot    files   readdir D  closedir D  print scalar  files   quot  n quot   42918  real    0m0 057s user    0m0 024s sys     0m0 033s  Note  I used  bin ls -f to make sure to bypass the alias option which might slow a little bit and -f to avoid file ordering  ls without -f is twice slower than find perl except if ls is used with -f  it seems to be the same time   user server logs   time  bin ls     wc -l 42916  real    0m0 109s user    0m0 070s sys     0m0 044s  I also would like to have some script to ask the file system directly without all the unnecessary information  The tests were based on the answers of Peter van der Heijden  glenn jackman  and mark4o

User · Answer

I prefer the following command to keep track of the changes in the number of files in a directory  watch -d -n 0 01  ls   wc -l   The command will keeps a window open to keep track of the number of files that are in the directory with a refresh rate of 0 1 seconds

User · Answer

I came here when trying to count the files in a data set of approximately 10 000 folders with approximately 10 000 files each  The problem with many of the approaches is that they implicitly stat 100 million files  which takes ages  I took the liberty to extend the approach by Christopher Schultz so it supports passing directories via arguments  his recursive approach uses stat as well   Put the following into file dircnt args c   include  lt stdio h gt   include  lt dirent h gt   int main int argc  char  argv          DIR  dir      struct dirent  ent      long count      long countsum   0      int i       for i 1  i  lt  argc  i              dir   opendir argv i            count   0          while  ent   readdir dir                  count           closedir dir            printf  quot  s contains  ld files n quot   argv i   count           countsum    count            printf  quot sum   ld n quot   countsum        return 0     After a gcc -o dircnt args dircnt args c you can invoke it like this  dircnt args  your directory    On 100 million files in 10 000 folders  the above completes quite quickly  approximately 5 minutes for the first run  and followup on cache  approximately 23 seconds   The only other approach that finished in less than an hour was ls with about 1 min on cache  ls -f  your directory     wc -l  The count is off by a couple of newlines per directory though    Other than expected  none of my attempts with find returned within an hour  -

User · Answer

Fast Linux file count The fastest Linux file count I know is locate -c -r   home   There is no need to invoke grep  But as mentioned  you should have a fresh database  updated daily by a cron job  or manual by sudo updatedb   From man locate -c  --count     Instead  of  writing  file  names on standard output  write the number of matching     entries only   Additional  you should know that it also counts the directories as files   BTW  If you want an overview of your files and directories on your system type locate -S  It outputs the number of directories  files  etc

User · Answer

You can get a count of files and directories with the tree program  Run the command tree   tail -n 1 to get the last line  which will say something like  quot 763 directories  9290 files quot   This counts files and folders recursively  excluding hidden files  which can be added with the flag -a  For reference  it took 4 8 seconds on my computer  for tree to count my whole home directory  which was 24 777 directories  238 680 files  find -type f   wc -l took 5 3 seconds  half a second longer  so I think tree is pretty competitive speed-wise  As long as you don t have any subfolders  tree is a quick and easy way to count the files  Also  and purely for the fun of it  you can use tree   grep      to only show the files folders in the current directory - this is basically a much slower version of ls

User · Answer

The fastest way is a purpose-built program  like this    include  lt stdio h gt   include  lt dirent h gt   int main int argc  char  argv          DIR  dir      struct dirent  ent      long count   0       dir   opendir argv 1         while  ent   readdir dir                  count       closedir dir        printf   s contains  ld files n   argv 1   count        return 0      From my testing without regard to cache  I ran each of these about 50 times each against the same directory  over and over  to avoid cache-based data skew  and I got roughly the following performance numbers  in real clock time    ls -1    wc - 0 01 67 ls -f1   wc - 0 00 14 find     wc - 0 00 22 dircnt   wc - 0 00 04   That last one  dircnt  is the program compiled from the above source   EDIT 2016-09-26  Due to popular demand  I ve re-written this program to be recursive  so it will drop into subdirectories and continue to count files and directories separately   Since it s clear some folks want to know how to do all this  I have a lot of comments in the code to try to make it obvious what s going on  I wrote this and tested it on 64-bit Linux  but it should work on any POSIX-compliant system  including Microsoft Windows  Bug reports are welcome  I m happy to update this if you can t get it working on your AIX or OS 400 or whatever   As you can see  it s much more complicated than the original and necessarily so  at least one function must exist to be called recursively unless you want the code to become very complex  e g  managing a subdirectory stack and processing that in a single loop   Since we have to check file types  differences between different OSs  standard libraries  etc  come into play  so I have written a program that tries to be usable on any system where it will compile   There is very little error checking  and the count function itself doesn t really report errors  The only calls that can really fail are opendir and stat  if you aren t lucky and have a system where dirent contains the file type already   I m not paranoid about checking the total length of the subdir pathnames  but theoretically  the system shouldn t allow any path name that is longer than than PATH MAX  If there are concerns  I can fix that  but it s just more code that needs to be explained to someone learning to write C  This program is intended to be an example of how to dive into subdirectories recursively    include  lt stdio h gt   include  lt dirent h gt   include  lt string h gt   include  lt stdlib h gt   include  lt limits h gt   include  lt sys stat h gt    if defined WIN32     defined  WIN32    define PATH SEPARATOR        else  define PATH SEPARATOR       endif     A custom structure to hold separate file and directory counts    struct filecount     long dirs    long files            counts the number of files and directories in the specified directory        path - relative pathname of a directory whose files should be counted    counts - pointer to struct containing file dir counts     void count char  path  struct filecount  counts        DIR  dir                    dir structure we are reading        struct dirent  ent          directory entry currently being processed        char subpath PATH MAX       buffer for building complete subdir and file names           Some systems don t have dirent d type field  we ll have to use stat   instead     if  defined    DIRENT HAVE D TYPE       struct stat statbuf         buffer for stat   info     endif     fprintf stderr   Opening dir  s n   path          dir   opendir path           opendir failed    file likely doesn t exist or isn t a directory        if NULL    dir            perror path           return             while  ent   readdir dir            if  strlen path    1   strlen ent- gt d name   gt  PATH MAX              fprintf stdout   path too long   ld   s c s    strlen path    1   strlen ent- gt d name    path  PATH SEPARATOR  ent- gt d name             return              Use dirent d type if present  otherwise use stat       if defined    DIRENT HAVE D TYPE      fprintf stderr   Using dirent d type n             if DT DIR    ent- gt d type     else    fprintf stderr   Don t have dirent d type  falling back to using stat   n             sprintf subpath    s c s   path  PATH SEPARATOR  ent- gt d name         if lstat subpath   amp statbuf               perror subpath             return                 if S ISDIR statbuf st mode      endif              Skip     and      directory entries    they are not  real  directories              if 0    strcmp       ent- gt d name     0    strcmp      ent- gt d name                     fprintf stderr   This is  s  skipping n   ent- gt d name                  else                 sprintf subpath    s c s   path  PATH SEPARATOR  ent- gt d name                 counts- gt dirs                  count subpath  counts                       else             counts- gt files                      fprintf stderr   Closing dir  s n   path          closedir dir      int main int argc  char  argv          struct filecount counts      counts files   0      counts dirs   0      count argv 1    amp counts           If we found nothing  this is probably an error which has already been printed        if 0  lt  counts files    0  lt  counts dirs            printf   s contains  ld files and  ld directories n   argv 1   counts files  counts dirs              return 0      EDIT 2017-01-17  I ve incorporated two changes suggested by  FlyingCodeMonkey    Use lstat instead of stat  This will change the behavior of the program if you have symlinked directories in the directory you are scanning  The previous behavior was that the  linked  subdirectory would have its file count added to the overall count  the new behavior is that the linked directory will count as a single file  and its contents will not be counted  If the path of a file is too long  an error message will be emitted and the program will halt    EDIT 2017-06-29  With any luck  this will be the last edit of this answer     I ve copied this code into a GitHub repository to make it a bit easier to get the code  instead of copy paste  you can just download the source   plus it makes it easier for anyone to suggest a modification by submitting a pull-request from GitHub   The source is available under Apache License 2 0  Patches   welcome       patch  is what old people like me call a  pull request

[linux] Fast Linux file count for a large number of files

Examples related to linux

Examples related to shell

Examples related to disk-io