[java] java get file size efficiently

While googling, I see that using java.io.File#length() can be slow. FileChannel has a size() method that is available as well.

Is there an efficient way in java to get the file size?

This question is related to java filesize

The answer is


Well, I tried to measure it up with the code below:

For runs = 1 and iterations = 1 the URL method is fastest most times followed by channel. I run this with some pause fresh about 10 times. So for one time access, using the URL is the fastest way I can think of:

LENGTH sum: 10626, per Iteration: 10626.0

CHANNEL sum: 5535, per Iteration: 5535.0

URL sum: 660, per Iteration: 660.0

For runs = 5 and iterations = 50 the picture draws different.

LENGTH sum: 39496, per Iteration: 157.984

CHANNEL sum: 74261, per Iteration: 297.044

URL sum: 95534, per Iteration: 382.136

File must be caching the calls to the filesystem, while channels and URL have some overhead.

Code:

import java.io.*;
import java.net.*;
import java.util.*;

public enum FileSizeBench {

    LENGTH {
        @Override
        public long getResult() throws Exception {
            File me = new File(FileSizeBench.class.getResource(
                    "FileSizeBench.class").getFile());
            return me.length();
        }
    },
    CHANNEL {
        @Override
        public long getResult() throws Exception {
            FileInputStream fis = null;
            try {
                File me = new File(FileSizeBench.class.getResource(
                        "FileSizeBench.class").getFile());
                fis = new FileInputStream(me);
                return fis.getChannel().size();
            } finally {
                fis.close();
            }
        }
    },
    URL {
        @Override
        public long getResult() throws Exception {
            InputStream stream = null;
            try {
                URL url = FileSizeBench.class
                        .getResource("FileSizeBench.class");
                stream = url.openStream();
                return stream.available();
            } finally {
                stream.close();
            }
        }
    };

    public abstract long getResult() throws Exception;

    public static void main(String[] args) throws Exception {
        int runs = 5;
        int iterations = 50;

        EnumMap<FileSizeBench, Long> durations = new EnumMap<FileSizeBench, Long>(FileSizeBench.class);

        for (int i = 0; i < runs; i++) {
            for (FileSizeBench test : values()) {
                if (!durations.containsKey(test)) {
                    durations.put(test, 0l);
                }
                long duration = testNow(test, iterations);
                durations.put(test, durations.get(test) + duration);
                // System.out.println(test + " took: " + duration + ", per iteration: " + ((double)duration / (double)iterations));
            }
        }

        for (Map.Entry<FileSizeBench, Long> entry : durations.entrySet()) {
            System.out.println();
            System.out.println(entry.getKey() + " sum: " + entry.getValue() + ", per Iteration: " + ((double)entry.getValue() / (double)(runs * iterations)));
        }

    }

    private static long testNow(FileSizeBench test, int iterations)
            throws Exception {
        long result = -1;
        long before = System.nanoTime();
        for (int i = 0; i < iterations; i++) {
            if (result == -1) {
                result = test.getResult();
                //System.out.println(result);
            } else if ((result = test.getResult()) != result) {
                 throw new Exception("variance detected!");
             }
        }
        return (System.nanoTime() - before) / 1000;
    }

}

When I modify your code to use a file accessed by an absolute path instead of a resource, I get a different result (for 1 run, 1 iteration, and a 100,000 byte file -- times for a 10 byte file are identical to 100,000 bytes)

LENGTH sum: 33, per Iteration: 33.0

CHANNEL sum: 3626, per Iteration: 3626.0

URL sum: 294, per Iteration: 294.0


I ran into this same issue. I needed to get the file size and modified date of 90,000 files on a network share. Using Java, and being as minimalistic as possible, it would take a very long time. (I needed to get the URL from the file, and the path of the object as well. So its varied somewhat, but more than an hour.) I then used a native Win32 executable, and did the same task, just dumping the file path, modified, and size to the console, and executed that from Java. The speed was amazing. The native process, and my string handling to read the data could process over 1000 items a second.

So even though people down ranked the above comment, this is a valid solution, and did solve my issue. In my case I knew the folders I needed the sizes of ahead of time, and I could pass that in the command line to my win32 app. I went from hours to process a directory to minutes.

The issue did also seem to be Windows specific. OS X did not have the same issue and could access network file info as fast as the OS could do so.

Java File handling on Windows is terrible. Local disk access for files is fine though. It was just network shares that caused the terrible performance. Windows could get info on the network share and calculate the total size in under a minute too.

--Ben


All the test cases in this post are flawed as they access the same file for each method tested. So disk caching kicks in which tests 2 and 3 benefit from. To prove my point I took test case provided by GHAD and changed the order of enumeration and below are the results.

Looking at result I think File.length() is the winner really.

Order of test is the order of output. You can even see the time taken on my machine varied between executions but File.Length() when not first, and incurring first disk access won.

---
LENGTH sum: 1163351, per Iteration: 4653.404
CHANNEL sum: 1094598, per Iteration: 4378.392
URL sum: 739691, per Iteration: 2958.764

---
CHANNEL sum: 845804, per Iteration: 3383.216
URL sum: 531334, per Iteration: 2125.336
LENGTH sum: 318413, per Iteration: 1273.652

--- 
URL sum: 137368, per Iteration: 549.472
LENGTH sum: 18677, per Iteration: 74.708
CHANNEL sum: 142125, per Iteration: 568.5

If you want the file size of multiple files in a directory, use Files.walkFileTree. You can obtain the size from the BasicFileAttributes that you'll receive.

This is much faster then calling .length() on the result of File.listFiles() or using Files.size() on the result of Files.newDirectoryStream(). In my test cases it was about 100 times faster.


In response to rgrig's benchmark, the time taken to open/close the FileChannel & RandomAccessFile instances also needs to be taken into account, as these classes will open a stream for reading the file.

After modifying the benchmark, I got these results for 1 iterations on a 85MB file:

file totalTime: 48000 (48 us)
raf totalTime: 261000 (261 us)
channel totalTime: 7020000 (7 ms)

For 10000 iterations on same file:

file totalTime: 80074000 (80 ms)
raf totalTime: 295417000 (295 ms)
channel totalTime: 368239000 (368 ms)

If all you need is the file size, file.length() is the fastest way to do it. If you plan to use the file for other purposes like reading/writing, then RAF seems to be a better bet. Just don't forget to close the file connection :-)

import java.io.File;
import java.io.FileInputStream;
import java.io.RandomAccessFile;
import java.nio.channels.FileChannel;
import java.util.HashMap;
import java.util.Map;

public class FileSizeBench
{    
    public static void main(String[] args) throws Exception
    {
        int iterations = 1;
        String fileEntry = args[0];

        Map<String, Long> times = new HashMap<String, Long>();
        times.put("file", 0L);
        times.put("channel", 0L);
        times.put("raf", 0L);

        long fileSize;
        long start;
        long end;
        File f1;
        FileChannel channel;
        RandomAccessFile raf;

        for (int i = 0; i < iterations; i++)
        {
            // file.length()
            start = System.nanoTime();
            f1 = new File(fileEntry);
            fileSize = f1.length();
            end = System.nanoTime();
            times.put("file", times.get("file") + end - start);

            // channel.size()
            start = System.nanoTime();
            channel = new FileInputStream(fileEntry).getChannel();
            fileSize = channel.size();
            channel.close();
            end = System.nanoTime();
            times.put("channel", times.get("channel") + end - start);

            // raf.length()
            start = System.nanoTime();
            raf = new RandomAccessFile(fileEntry, "r");
            fileSize = raf.length();
            raf.close();
            end = System.nanoTime();
            times.put("raf", times.get("raf") + end - start);
        }

        for (Map.Entry<String, Long> entry : times.entrySet()) {
            System.out.println(entry.getKey() + " totalTime: " + entry.getValue() + " (" + getTime(entry.getValue()) + ")");
        }
    }

    public static String getTime(Long timeTaken)
    {
        if (timeTaken < 1000) {
            return timeTaken + " ns";
        } else if (timeTaken < (1000*1000)) {
            return timeTaken/1000 + " us"; 
        } else {
            return timeTaken/(1000*1000) + " ms";
        } 
    }
}

From GHad's benchmark, there are a few issue people have mentioned:

1>Like BalusC mentioned: stream.available() is flowed in this case.

Because available() returns an estimate of the number of bytes that can be read (or skipped over) from this input stream without blocking by the next invocation of a method for this input stream.

So 1st to remove the URL this approach.

2>As StuartH mentioned - the order the test run also make the cache difference, so take that out by run the test separately.


Now start test:

When CHANNEL one run alone:

CHANNEL sum: 59691, per Iteration: 238.764

When LENGTH one run alone:

LENGTH sum: 48268, per Iteration: 193.072

So looks like the LENGTH one is the winner here:

@Override
public long getResult() throws Exception {
    File me = new File(FileSizeBench.class.getResource(
            "FileSizeBench.class").getFile());
    return me.length();
}

The benchmark given by GHad measures lots of other stuff (such as reflection, instantiating objects, etc.) besides getting the length. If we try to get rid of these things then for one call I get the following times in microseconds:

   file sum___19.0, per Iteration___19.0
    raf sum___16.0, per Iteration___16.0
channel sum__273.0, per Iteration__273.0

For 100 runs and 10000 iterations I get:

   file sum__1767629.0, per Iteration__1.7676290000000001
    raf sum___881284.0, per Iteration__0.8812840000000001
channel sum___414286.0, per Iteration__0.414286

I did run the following modified code giving as an argument the name of a 100MB file.

import java.io.*;
import java.nio.channels.*;
import java.net.*;
import java.util.*;

public class FileSizeBench {

  private static File file;
  private static FileChannel channel;
  private static RandomAccessFile raf;

  public static void main(String[] args) throws Exception {
    int runs = 1;
    int iterations = 1;

    file = new File(args[0]);
    channel = new FileInputStream(args[0]).getChannel();
    raf = new RandomAccessFile(args[0], "r");

    HashMap<String, Double> times = new HashMap<String, Double>();
    times.put("file", 0.0);
    times.put("channel", 0.0);
    times.put("raf", 0.0);

    long start;
    for (int i = 0; i < runs; ++i) {
      long l = file.length();

      start = System.nanoTime();
      for (int j = 0; j < iterations; ++j)
        if (l != file.length()) throw new Exception();
      times.put("file", times.get("file") + System.nanoTime() - start);

      start = System.nanoTime();
      for (int j = 0; j < iterations; ++j)
        if (l != channel.size()) throw new Exception();
      times.put("channel", times.get("channel") + System.nanoTime() - start);

      start = System.nanoTime();
      for (int j = 0; j < iterations; ++j)
        if (l != raf.length()) throw new Exception();
      times.put("raf", times.get("raf") + System.nanoTime() - start);
    }
    for (Map.Entry<String, Double> entry : times.entrySet()) {
        System.out.println(
            entry.getKey() + " sum: " + 1e-3 * entry.getValue() +
            ", per Iteration: " + (1e-3 * entry.getValue() / runs / iterations));
    }
  }
}

Actually, I think the "ls" may be faster. There are definitely some issues in Java dealing with getting File info. Unfortunately there is no equivalent safe method of recursive ls for Windows. (cmd.exe's DIR /S can get confused and generate errors in infinite loops)

On XP, accessing a server on the LAN, it takes me 5 seconds in Windows to get the count of the files in a folder (33,000), and the total size.

When I iterate recursively through this in Java, it takes me over 5 minutes. I started measuring the time it takes to do file.length(), file.lastModified(), and file.toURI() and what I found is that 99% of my time is taken by those 3 calls. The 3 calls I actually need to do...

The difference for 1000 files is 15ms local versus 1800ms on server. The server path scanning in Java is ridiculously slow. If the native OS can be fast at scanning that same folder, why can't Java?

As a more complete test, I used WineMerge on XP to compare the modified date, and size of the files on the server versus the files locally. This was iterating over the entire directory tree of 33,000 files in each folder. Total time, 7 seconds. java: over 5 minutes.

So the original statement and question from the OP is true, and valid. Its less noticeable when dealing with a local file system. Doing a local compare of the folder with 33,000 items takes 3 seconds in WinMerge, and takes 32 seconds locally in Java. So again, java versus native is a 10x slowdown in these rudimentary tests.

Java 1.6.0_22 (latest), Gigabit LAN, and network connections, ping is less than 1ms (both in the same switch)

Java is slow.