How do you determine the ideal buffer size when using FileInputStream

Question

I have a method that creates a MessageDigest  a hash  from a file  and I need to do this to a lot of files     100 000   How big should I make the buffer used to read from the files to maximize performance   Most everyone is familiar with the basic code  which I ll repeat here just in case    MessageDigest md   MessageDigest getInstance   SHA     FileInputStream ios   new FileInputStream   myfile bmp     byte   buffer   new byte 4   1024      what should this value be  int read   0  while    read   ios read  buffer      gt  0       md update  buffer  0  read    ios close    md digest      What is the ideal size of the buffer to maximize throughput  I know this is system dependent  and I m pretty sure its OS  FileSystem  and HDD dependent  and there maybe other hardware software in the mix     I should point out that I m somewhat new to Java  so this may just be some Java API call I don t know about    Edit  I do not know ahead of time the kinds of systems this will be used on  so I can t assume a whole lot   I m using Java for that reason    Edit  The code above is missing things like try  catch to make the post smaller

User · Answer

In BufferedInputStream   s source you will find  private static int DEFAULT BUFFER SIZE   8192  So it s okey for you to use that default value  But if you can figure out some more information you will get more valueable answers  For example  your adsl maybe preffer a buffer of 1454 bytes  thats because TCP IP s payload  For disks  you may use a value that match your disk s block size

User · Answer

You could use the BufferedStreams readers and then use their buffer sizes   I believe the BufferedXStreams are using 8192 as the buffer size  but like Ovidiu said  you should probably run a test on a whole bunch of options   Its really going to depend on the filesystem and disk configurations as to what the best sizes are

User · Answer

As already mentioned in other answers  use BufferedInputStreams   After that  I guess the buffer size does not really matter  Either the program is I O bound  and growing buffer size over BIS default  will not make any big impact on performance   Or the program is CPU bound inside the MessageDigest update    and majority of the time is not spent in the application code  so tweaking it will not help    Hmm    with multiple cores  threads might help

User · Answer

You could use the BufferedStreams readers and then use their buffer sizes   I believe the BufferedXStreams are using 8192 as the buffer size  but like Ovidiu said  you should probably run a test on a whole bunch of options   Its really going to depend on the filesystem and disk configurations as to what the best sizes are

User · Answer

In most cases  it really doesn t matter that much   Just pick a good size such as 4K or 16K and stick with it   If you re positive that this is the bottleneck in your application  then you should start profiling to find the optimal buffer size   If you pick a size that s too small  you ll waste time doing extra I O operations and extra function calls   If you pick a size that s too big  you ll start seeing a lot of cache misses which will really slow you down   Don t use a buffer bigger than your L2 cache size

User · Answer

In the ideal case we should have enough memory to read the file in one read operation  That would be the best performer because we let the system manage File System   allocation units and HDD at will  In practice you are fortunate to know the file sizes in advance  just use the average file size rounded up to 4K  default allocation unit on NTFS   And best of all   create a benchmark to test multiple options

User · Answer

1024 is appropriate for a wide variety of circumstances  although in practice you may see better performance with a larger or smaller buffer size    This would depend on a number of factors including file system block size and CPU hardware   It is also common to choose a power of 2 for the buffer size  since most underlying hardware is structured with fle block and cache sizes that are a power of 2  The Buffered classes allow you to specify the buffer size in the constructor  If none is provided  they use a default value  which is a power of 2 in most JVMs   Regardless of which buffer size you choose  the biggest performance increase you will see is moving from nonbuffered to buffered file access  Adjusting the buffer size may improve performance slightly  but unless you are using an extremely small or extremely large buffer size  it is unlikely to have a signifcant impact

User · Answer

As already mentioned in other answers  use BufferedInputStreams   After that  I guess the buffer size does not really matter  Either the program is I O bound  and growing buffer size over BIS default  will not make any big impact on performance   Or the program is CPU bound inside the MessageDigest update    and majority of the time is not spent in the application code  so tweaking it will not help    Hmm    with multiple cores  threads might help

User · Answer

Reading files using Java NIO s FileChannel and MappedByteBuffer will most likely result in a solution that will be much faster than any solution involving FileInputStream  Basically  memory-map large files  and use direct buffers for small ones

User · Answer

In the ideal case we should have enough memory to read the file in one read operation  That would be the best performer because we let the system manage File System   allocation units and HDD at will  In practice you are fortunate to know the file sizes in advance  just use the average file size rounded up to 4K  default allocation unit on NTFS   And best of all   create a benchmark to test multiple options

User · Answer

1024 is appropriate for a wide variety of circumstances  although in practice you may see better performance with a larger or smaller buffer size    This would depend on a number of factors including file system block size and CPU hardware   It is also common to choose a power of 2 for the buffer size  since most underlying hardware is structured with fle block and cache sizes that are a power of 2  The Buffered classes allow you to specify the buffer size in the constructor  If none is provided  they use a default value  which is a power of 2 in most JVMs   Regardless of which buffer size you choose  the biggest performance increase you will see is moving from nonbuffered to buffered file access  Adjusting the buffer size may improve performance slightly  but unless you are using an extremely small or extremely large buffer size  it is unlikely to have a signifcant impact

User · Answer

In the ideal case we should have enough memory to read the file in one read operation  That would be the best performer because we let the system manage File System   allocation units and HDD at will  In practice you are fortunate to know the file sizes in advance  just use the average file size rounded up to 4K  default allocation unit on NTFS   And best of all   create a benchmark to test multiple options

User · Answer

You could use the BufferedStreams readers and then use their buffer sizes   I believe the BufferedXStreams are using 8192 as the buffer size  but like Ovidiu said  you should probably run a test on a whole bunch of options   Its really going to depend on the filesystem and disk configurations as to what the best sizes are

User · Answer

In most cases  it really doesn t matter that much   Just pick a good size such as 4K or 16K and stick with it   If you re positive that this is the bottleneck in your application  then you should start profiling to find the optimal buffer size   If you pick a size that s too small  you ll waste time doing extra I O operations and extra function calls   If you pick a size that s too big  you ll start seeing a lot of cache misses which will really slow you down   Don t use a buffer bigger than your L2 cache size

User · Answer

Yes  it s probably dependent on various things - but I doubt it will make very much difference  I tend to opt for 16K or 32K as a good balance between memory usage and performance   Note that you should have a try finally block in the code to make sure the stream is closed even if an exception is thrown

User · Answer

Optimum buffer size is related to a number of things   file system block size  CPU cache size and cache latency   Most file systems are configured to use block sizes of 4096 or 8192   In theory  if you configure your buffer size so you are reading a few bytes more than the disk block  the operations with the file system can be extremely inefficient  i e  if you configured your buffer to read 4100 bytes at a time  each read would require 2 block reads by the file system    If the blocks are already in cache  then you wind up paying the price of RAM -  L3 L2 cache latency   If you are unlucky and the blocks are not in cache yet  the you pay the price of the disk- RAM latency as well   This is why you see most buffers sized as a power of 2  and generally larger than  or equal to  the disk block size   This means that one of your stream reads could result in multiple disk block reads - but those reads will always use a full block - no wasted reads   Now  this is offset quite a bit in a typical streaming scenario because the block that is read from disk is going to still be in memory when you hit the next read  we are doing sequential reads here  after all  - so you wind up paying the RAM -  L3 L2 cache latency price on the next read  but not the disk- RAM latency   In terms of order of magnitude  disk- RAM latency is so slow that it pretty much swamps any other latency you might be dealing with   So  I suspect that if you ran a test with different cache sizes  haven t done this myself   you will probably find a big impact of cache size up to the size of the file system block   Above that  I suspect that things would level out pretty quickly   There are a ton of conditions and exceptions here - the complexities of the system are actually quite staggering  just getting a handle on L3 -  L2 cache transfers is mind bogglingly complex  and it changes with every CPU type    This leads to the  real world  answer   If your app is like 99  out there  set the cache size to 8192 and move on  even better  choose encapsulation over performance and use BufferedInputStream to hide the details    If you are in the 1  of apps that are highly dependent on disk throughput  craft your implementation so you can swap out different disk interaction strategies  and provide the knobs and dials to allow your users to test and optimize  or come up with some self optimizing system

User · Answer

You could use the BufferedStreams readers and then use their buffer sizes   I believe the BufferedXStreams are using 8192 as the buffer size  but like Ovidiu said  you should probably run a test on a whole bunch of options   Its really going to depend on the filesystem and disk configurations as to what the best sizes are

User · Answer

Reading files using Java NIO s FileChannel and MappedByteBuffer will most likely result in a solution that will be much faster than any solution involving FileInputStream  Basically  memory-map large files  and use direct buffers for small ones

User · Answer

Reading files using Java NIO s FileChannel and MappedByteBuffer will most likely result in a solution that will be much faster than any solution involving FileInputStream  Basically  memory-map large files  and use direct buffers for small ones

User · Answer

In most cases  it really doesn t matter that much   Just pick a good size such as 4K or 16K and stick with it   If you re positive that this is the bottleneck in your application  then you should start profiling to find the optimal buffer size   If you pick a size that s too small  you ll waste time doing extra I O operations and extra function calls   If you pick a size that s too big  you ll start seeing a lot of cache misses which will really slow you down   Don t use a buffer bigger than your L2 cache size

User · Answer

Optimum buffer size is related to a number of things   file system block size  CPU cache size and cache latency   Most file systems are configured to use block sizes of 4096 or 8192   In theory  if you configure your buffer size so you are reading a few bytes more than the disk block  the operations with the file system can be extremely inefficient  i e  if you configured your buffer to read 4100 bytes at a time  each read would require 2 block reads by the file system    If the blocks are already in cache  then you wind up paying the price of RAM -  L3 L2 cache latency   If you are unlucky and the blocks are not in cache yet  the you pay the price of the disk- RAM latency as well   This is why you see most buffers sized as a power of 2  and generally larger than  or equal to  the disk block size   This means that one of your stream reads could result in multiple disk block reads - but those reads will always use a full block - no wasted reads   Now  this is offset quite a bit in a typical streaming scenario because the block that is read from disk is going to still be in memory when you hit the next read  we are doing sequential reads here  after all  - so you wind up paying the RAM -  L3 L2 cache latency price on the next read  but not the disk- RAM latency   In terms of order of magnitude  disk- RAM latency is so slow that it pretty much swamps any other latency you might be dealing with   So  I suspect that if you ran a test with different cache sizes  haven t done this myself   you will probably find a big impact of cache size up to the size of the file system block   Above that  I suspect that things would level out pretty quickly   There are a ton of conditions and exceptions here - the complexities of the system are actually quite staggering  just getting a handle on L3 -  L2 cache transfers is mind bogglingly complex  and it changes with every CPU type    This leads to the  real world  answer   If your app is like 99  out there  set the cache size to 8192 and move on  even better  choose encapsulation over performance and use BufferedInputStream to hide the details    If you are in the 1  of apps that are highly dependent on disk throughput  craft your implementation so you can swap out different disk interaction strategies  and provide the knobs and dials to allow your users to test and optimize  or come up with some self optimizing system

User · Answer

Yes  it s probably dependent on various things - but I doubt it will make very much difference  I tend to opt for 16K or 32K as a good balance between memory usage and performance   Note that you should have a try finally block in the code to make sure the stream is closed even if an exception is thrown

User · Answer

Yes  it s probably dependent on various things - but I doubt it will make very much difference  I tend to opt for 16K or 32K as a good balance between memory usage and performance   Note that you should have a try finally block in the code to make sure the stream is closed even if an exception is thrown

User · Answer

Reading files using Java NIO s FileChannel and MappedByteBuffer will most likely result in a solution that will be much faster than any solution involving FileInputStream  Basically  memory-map large files  and use direct buffers for small ones

User · Answer

Optimum buffer size is related to a number of things   file system block size  CPU cache size and cache latency   Most file systems are configured to use block sizes of 4096 or 8192   In theory  if you configure your buffer size so you are reading a few bytes more than the disk block  the operations with the file system can be extremely inefficient  i e  if you configured your buffer to read 4100 bytes at a time  each read would require 2 block reads by the file system    If the blocks are already in cache  then you wind up paying the price of RAM -  L3 L2 cache latency   If you are unlucky and the blocks are not in cache yet  the you pay the price of the disk- RAM latency as well   This is why you see most buffers sized as a power of 2  and generally larger than  or equal to  the disk block size   This means that one of your stream reads could result in multiple disk block reads - but those reads will always use a full block - no wasted reads   Now  this is offset quite a bit in a typical streaming scenario because the block that is read from disk is going to still be in memory when you hit the next read  we are doing sequential reads here  after all  - so you wind up paying the RAM -  L3 L2 cache latency price on the next read  but not the disk- RAM latency   In terms of order of magnitude  disk- RAM latency is so slow that it pretty much swamps any other latency you might be dealing with   So  I suspect that if you ran a test with different cache sizes  haven t done this myself   you will probably find a big impact of cache size up to the size of the file system block   Above that  I suspect that things would level out pretty quickly   There are a ton of conditions and exceptions here - the complexities of the system are actually quite staggering  just getting a handle on L3 -  L2 cache transfers is mind bogglingly complex  and it changes with every CPU type    This leads to the  real world  answer   If your app is like 99  out there  set the cache size to 8192 and move on  even better  choose encapsulation over performance and use BufferedInputStream to hide the details    If you are in the 1  of apps that are highly dependent on disk throughput  craft your implementation so you can swap out different disk interaction strategies  and provide the knobs and dials to allow your users to test and optimize  or come up with some self optimizing system

User · Answer

As already mentioned in other answers  use BufferedInputStreams   After that  I guess the buffer size does not really matter  Either the program is I O bound  and growing buffer size over BIS default  will not make any big impact on performance   Or the program is CPU bound inside the MessageDigest update    and majority of the time is not spent in the application code  so tweaking it will not help    Hmm    with multiple cores  threads might help

User · Answer

As already mentioned in other answers  use BufferedInputStreams   After that  I guess the buffer size does not really matter  Either the program is I O bound  and growing buffer size over BIS default  will not make any big impact on performance   Or the program is CPU bound inside the MessageDigest update    and majority of the time is not spent in the application code  so tweaking it will not help    Hmm    with multiple cores  threads might help

User · Answer

Optimum buffer size is related to a number of things   file system block size  CPU cache size and cache latency   Most file systems are configured to use block sizes of 4096 or 8192   In theory  if you configure your buffer size so you are reading a few bytes more than the disk block  the operations with the file system can be extremely inefficient  i e  if you configured your buffer to read 4100 bytes at a time  each read would require 2 block reads by the file system    If the blocks are already in cache  then you wind up paying the price of RAM -  L3 L2 cache latency   If you are unlucky and the blocks are not in cache yet  the you pay the price of the disk- RAM latency as well   This is why you see most buffers sized as a power of 2  and generally larger than  or equal to  the disk block size   This means that one of your stream reads could result in multiple disk block reads - but those reads will always use a full block - no wasted reads   Now  this is offset quite a bit in a typical streaming scenario because the block that is read from disk is going to still be in memory when you hit the next read  we are doing sequential reads here  after all  - so you wind up paying the RAM -  L3 L2 cache latency price on the next read  but not the disk- RAM latency   In terms of order of magnitude  disk- RAM latency is so slow that it pretty much swamps any other latency you might be dealing with   So  I suspect that if you ran a test with different cache sizes  haven t done this myself   you will probably find a big impact of cache size up to the size of the file system block   Above that  I suspect that things would level out pretty quickly   There are a ton of conditions and exceptions here - the complexities of the system are actually quite staggering  just getting a handle on L3 -  L2 cache transfers is mind bogglingly complex  and it changes with every CPU type    This leads to the  real world  answer   If your app is like 99  out there  set the cache size to 8192 and move on  even better  choose encapsulation over performance and use BufferedInputStream to hide the details    If you are in the 1  of apps that are highly dependent on disk throughput  craft your implementation so you can swap out different disk interaction strategies  and provide the knobs and dials to allow your users to test and optimize  or come up with some self optimizing system

User · Answer

In most cases  it really doesn t matter that much   Just pick a good size such as 4K or 16K and stick with it   If you re positive that this is the bottleneck in your application  then you should start profiling to find the optimal buffer size   If you pick a size that s too small  you ll waste time doing extra I O operations and extra function calls   If you pick a size that s too big  you ll start seeing a lot of cache misses which will really slow you down   Don t use a buffer bigger than your L2 cache size

User · Answer

In the ideal case we should have enough memory to read the file in one read operation  That would be the best performer because we let the system manage File System   allocation units and HDD at will  In practice you are fortunate to know the file sizes in advance  just use the average file size rounded up to 4K  default allocation unit on NTFS   And best of all   create a benchmark to test multiple options

User · Answer

Yes  it s probably dependent on various things - but I doubt it will make very much difference  I tend to opt for 16K or 32K as a good balance between memory usage and performance   Note that you should have a try finally block in the code to make sure the stream is closed even if an exception is thrown

User · Answer

In BufferedInputStream   s source you will find  private static int DEFAULT BUFFER SIZE   8192  So it s okey for you to use that default value  But if you can figure out some more information you will get more valueable answers  For example  your adsl maybe preffer a buffer of 1454 bytes  thats because TCP IP s payload  For disks  you may use a value that match your disk s block size

[java] How do you determine the ideal buffer size when using FileInputStream?

Examples related to java

Examples related to performance

Examples related to file-io

Examples related to filesystems

Examples related to buffer