[java] How do you determine the ideal buffer size when using FileInputStream?

I have a method that creates a MessageDigest (a hash) from a file, and I need to do this to a lot of files (>= 100,000). How big should I make the buffer used to read from the files to maximize performance?

Most everyone is familiar with the basic code (which I'll repeat here just in case):

MessageDigest md = MessageDigest.getInstance( "SHA" );
FileInputStream ios = new FileInputStream( "myfile.bmp" );
byte[] buffer = new byte[4 * 1024]; // what should this value be?
int read = 0;
while( ( read = ios.read( buffer ) ) > 0 )
    md.update( buffer, 0, read );
ios.close();
md.digest();

What is the ideal size of the buffer to maximize throughput? I know this is system dependent, and I'm pretty sure its OS, FileSystem, and HDD dependent, and there maybe other hardware/software in the mix.

(I should point out that I'm somewhat new to Java, so this may just be some Java API call I don't know about.)

Edit: I do not know ahead of time the kinds of systems this will be used on, so I can't assume a whole lot. (I'm using Java for that reason.)

Edit: The code above is missing things like try..catch to make the post smaller

This question is related to java performance file-io filesystems buffer

The answer is


You could use the BufferedStreams/readers and then use their buffer sizes.

I believe the BufferedXStreams are using 8192 as the buffer size, but like Ovidiu said, you should probably run a test on a whole bunch of options. Its really going to depend on the filesystem and disk configurations as to what the best sizes are.


1024 is appropriate for a wide variety of circumstances, although in practice you may see better performance with a larger or smaller buffer size.

This would depend on a number of factors including file system block size and CPU hardware.

It is also common to choose a power of 2 for the buffer size, since most underlying hardware is structured with fle block and cache sizes that are a power of 2. The Buffered classes allow you to specify the buffer size in the constructor. If none is provided, they use a default value, which is a power of 2 in most JVMs.

Regardless of which buffer size you choose, the biggest performance increase you will see is moving from nonbuffered to buffered file access. Adjusting the buffer size may improve performance slightly, but unless you are using an extremely small or extremely large buffer size, it is unlikely to have a signifcant impact.


As already mentioned in other answers, use BufferedInputStreams.

After that, I guess the buffer size does not really matter. Either the program is I/O bound, and growing buffer size over BIS default, will not make any big impact on performance.

Or the program is CPU bound inside the MessageDigest.update(), and majority of the time is not spent in the application code, so tweaking it will not help.

(Hmm... with multiple cores, threads might help.)


You could use the BufferedStreams/readers and then use their buffer sizes.

I believe the BufferedXStreams are using 8192 as the buffer size, but like Ovidiu said, you should probably run a test on a whole bunch of options. Its really going to depend on the filesystem and disk configurations as to what the best sizes are.


Reading files using Java NIO's FileChannel and MappedByteBuffer will most likely result in a solution that will be much faster than any solution involving FileInputStream. Basically, memory-map large files, and use direct buffers for small ones.


Yes, it's probably dependent on various things - but I doubt it will make very much difference. I tend to opt for 16K or 32K as a good balance between memory usage and performance.

Note that you should have a try/finally block in the code to make sure the stream is closed even if an exception is thrown.


Reading files using Java NIO's FileChannel and MappedByteBuffer will most likely result in a solution that will be much faster than any solution involving FileInputStream. Basically, memory-map large files, and use direct buffers for small ones.


In the ideal case we should have enough memory to read the file in one read operation. That would be the best performer because we let the system manage File System , allocation units and HDD at will. In practice you are fortunate to know the file sizes in advance, just use the average file size rounded up to 4K (default allocation unit on NTFS). And best of all : create a benchmark to test multiple options.


In most cases, it really doesn't matter that much. Just pick a good size such as 4K or 16K and stick with it. If you're positive that this is the bottleneck in your application, then you should start profiling to find the optimal buffer size. If you pick a size that's too small, you'll waste time doing extra I/O operations and extra function calls. If you pick a size that's too big, you'll start seeing a lot of cache misses which will really slow you down. Don't use a buffer bigger than your L2 cache size.


As already mentioned in other answers, use BufferedInputStreams.

After that, I guess the buffer size does not really matter. Either the program is I/O bound, and growing buffer size over BIS default, will not make any big impact on performance.

Or the program is CPU bound inside the MessageDigest.update(), and majority of the time is not spent in the application code, so tweaking it will not help.

(Hmm... with multiple cores, threads might help.)


In most cases, it really doesn't matter that much. Just pick a good size such as 4K or 16K and stick with it. If you're positive that this is the bottleneck in your application, then you should start profiling to find the optimal buffer size. If you pick a size that's too small, you'll waste time doing extra I/O operations and extra function calls. If you pick a size that's too big, you'll start seeing a lot of cache misses which will really slow you down. Don't use a buffer bigger than your L2 cache size.


Reading files using Java NIO's FileChannel and MappedByteBuffer will most likely result in a solution that will be much faster than any solution involving FileInputStream. Basically, memory-map large files, and use direct buffers for small ones.


1024 is appropriate for a wide variety of circumstances, although in practice you may see better performance with a larger or smaller buffer size.

This would depend on a number of factors including file system block size and CPU hardware.

It is also common to choose a power of 2 for the buffer size, since most underlying hardware is structured with fle block and cache sizes that are a power of 2. The Buffered classes allow you to specify the buffer size in the constructor. If none is provided, they use a default value, which is a power of 2 in most JVMs.

Regardless of which buffer size you choose, the biggest performance increase you will see is moving from nonbuffered to buffered file access. Adjusting the buffer size may improve performance slightly, but unless you are using an extremely small or extremely large buffer size, it is unlikely to have a signifcant impact.


In most cases, it really doesn't matter that much. Just pick a good size such as 4K or 16K and stick with it. If you're positive that this is the bottleneck in your application, then you should start profiling to find the optimal buffer size. If you pick a size that's too small, you'll waste time doing extra I/O operations and extra function calls. If you pick a size that's too big, you'll start seeing a lot of cache misses which will really slow you down. Don't use a buffer bigger than your L2 cache size.


Yes, it's probably dependent on various things - but I doubt it will make very much difference. I tend to opt for 16K or 32K as a good balance between memory usage and performance.

Note that you should have a try/finally block in the code to make sure the stream is closed even if an exception is thrown.


In the ideal case we should have enough memory to read the file in one read operation. That would be the best performer because we let the system manage File System , allocation units and HDD at will. In practice you are fortunate to know the file sizes in advance, just use the average file size rounded up to 4K (default allocation unit on NTFS). And best of all : create a benchmark to test multiple options.


Yes, it's probably dependent on various things - but I doubt it will make very much difference. I tend to opt for 16K or 32K as a good balance between memory usage and performance.

Note that you should have a try/finally block in the code to make sure the stream is closed even if an exception is thrown.


In BufferedInputStream‘s source you will find: private static int DEFAULT_BUFFER_SIZE = 8192;
So it's okey for you to use that default value.
But if you can figure out some more information you will get more valueable answers.
For example, your adsl maybe preffer a buffer of 1454 bytes, thats because TCP/IP's payload. For disks, you may use a value that match your disk's block size.


In most cases, it really doesn't matter that much. Just pick a good size such as 4K or 16K and stick with it. If you're positive that this is the bottleneck in your application, then you should start profiling to find the optimal buffer size. If you pick a size that's too small, you'll waste time doing extra I/O operations and extra function calls. If you pick a size that's too big, you'll start seeing a lot of cache misses which will really slow you down. Don't use a buffer bigger than your L2 cache size.


As already mentioned in other answers, use BufferedInputStreams.

After that, I guess the buffer size does not really matter. Either the program is I/O bound, and growing buffer size over BIS default, will not make any big impact on performance.

Or the program is CPU bound inside the MessageDigest.update(), and majority of the time is not spent in the application code, so tweaking it will not help.

(Hmm... with multiple cores, threads might help.)


In the ideal case we should have enough memory to read the file in one read operation. That would be the best performer because we let the system manage File System , allocation units and HDD at will. In practice you are fortunate to know the file sizes in advance, just use the average file size rounded up to 4K (default allocation unit on NTFS). And best of all : create a benchmark to test multiple options.


Yes, it's probably dependent on various things - but I doubt it will make very much difference. I tend to opt for 16K or 32K as a good balance between memory usage and performance.

Note that you should have a try/finally block in the code to make sure the stream is closed even if an exception is thrown.


As already mentioned in other answers, use BufferedInputStreams.

After that, I guess the buffer size does not really matter. Either the program is I/O bound, and growing buffer size over BIS default, will not make any big impact on performance.

Or the program is CPU bound inside the MessageDigest.update(), and majority of the time is not spent in the application code, so tweaking it will not help.

(Hmm... with multiple cores, threads might help.)


Reading files using Java NIO's FileChannel and MappedByteBuffer will most likely result in a solution that will be much faster than any solution involving FileInputStream. Basically, memory-map large files, and use direct buffers for small ones.


In the ideal case we should have enough memory to read the file in one read operation. That would be the best performer because we let the system manage File System , allocation units and HDD at will. In practice you are fortunate to know the file sizes in advance, just use the average file size rounded up to 4K (default allocation unit on NTFS). And best of all : create a benchmark to test multiple options.


You could use the BufferedStreams/readers and then use their buffer sizes.

I believe the BufferedXStreams are using 8192 as the buffer size, but like Ovidiu said, you should probably run a test on a whole bunch of options. Its really going to depend on the filesystem and disk configurations as to what the best sizes are.


In BufferedInputStream‘s source you will find: private static int DEFAULT_BUFFER_SIZE = 8192;
So it's okey for you to use that default value.
But if you can figure out some more information you will get more valueable answers.
For example, your adsl maybe preffer a buffer of 1454 bytes, thats because TCP/IP's payload. For disks, you may use a value that match your disk's block size.


Examples related to java

Under what circumstances can I call findViewById with an Options Menu / Action Bar item? How much should a function trust another function How to implement a simple scenario the OO way Two constructors How do I get some variable from another class in Java? this in equals method How to split a string in two and store it in a field How to do perspective fixing? String index out of range: 4 My eclipse won't open, i download the bundle pack it keeps saying error log

Examples related to performance

Why is 2 * (i * i) faster than 2 * i * i in Java? What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism? How to check if a key exists in Json Object and get its value Why does C++ code for testing the Collatz conjecture run faster than hand-written assembly? Most efficient way to map function over numpy array The most efficient way to remove first N elements in a list? Fastest way to get the first n elements of a List into an Array Why is "1000000000000000 in range(1000000000000001)" so fast in Python 3? pandas loc vs. iloc vs. at vs. iat? Android Recyclerview vs ListView with Viewholder

Examples related to file-io

Python, Pandas : write content of DataFrame into text File Saving response from Requests to file How to while loop until the end of a file in Python without checking for empty line? Getting "java.nio.file.AccessDeniedException" when trying to write to a folder How do I add a resources folder to my Java project in Eclipse Read and write a String from text file Python Pandas: How to read only first n rows of CSV files in? Open files in 'rt' and 'wt' modes How to write to a file without overwriting current contents? Write objects into file with Node.js

Examples related to filesystems

Get an image extension from an uploaded file in Laravel Notepad++ cached files location No space left on device How to create a directory using Ansible best way to get folder and file list in Javascript Exploring Docker container's file system Remove directory which is not empty GIT_DISCOVERY_ACROSS_FILESYSTEM not set Trying to create a file in Android: open failed: EROFS (Read-only file system) Node.js check if path is file or directory

Examples related to buffer

Convert a JSON Object to Buffer and Buffer to JSON Object back C char array initialization Save Screen (program) output to a file Flushing buffers in C Char array to hex string C++ How to append binary data to a buffer in node.js Convert a binary NodeJS Buffer to JavaScript ArrayBuffer How to clear input buffer in C? How to find the socket buffer size of linux Ring Buffer in Java