[c#] What is the fastest way to create a checksum for large files in C#

I have to sync large files across some machines. The files can be up to 6GB in size. The sync will be done manually every few weeks. I cant take the filename into consideration because they can change anytime.

My plan is to create checksums on the destination PC and on the source PC and then copy all files with a checksum, which are not already in the destination, to the destination. My first attempt was something like this:

using System.IO;
using System.Security.Cryptography;

private static string GetChecksum(string file)
{
    using (FileStream stream = File.OpenRead(file))
    {
        SHA256Managed sha = new SHA256Managed();
        byte[] checksum = sha.ComputeHash(stream);
        return BitConverter.ToString(checksum).Replace("-", String.Empty);
    }
}

The Problem was the runtime:
- with SHA256 with a 1,6 GB File -> 20 minutes
- with MD5 with a 1,6 GB File -> 6.15 minutes

Is there a better - faster - way to get the checksum (maybe with a better hash function)?

This question is related to c# .net large-files checksum

The answer is


Ok - thanks to all of you - let me wrap this up:

  1. using a "native" exe to do the hashing took time from 6 Minutes to 10 Seconds which is huge.
  2. Increasing the buffer was even faster - 1.6GB file took 5.2 seconds using MD5 in .Net, so I will go with this solution - thanks again

I know that I am late to party but performed test before actually implement the solution.

I did perform test against inbuilt MD5 class and also md5sum.exe. In my case inbuilt class took 13 second where md5sum.exe too around 16-18 seconds in every run.

    DateTime current = DateTime.Now;
    string file = @"C:\text.iso";//It's 2.5 Gb file
    string output;
    using (var md5 = MD5.Create())
    {
        using (var stream = File.OpenRead(file))
        {
            byte[] checksum = md5.ComputeHash(stream);
            output = BitConverter.ToString(checksum).Replace("-", String.Empty).ToLower();
            Console.WriteLine("Total seconds : " + (DateTime.Now - current).TotalSeconds.ToString() + " " + output);
        }
    }

You can have a look to XxHash.Net ( https://github.com/wilhelmliao/xxHash.NET )
The xxHash algorythm seems to be faster than all other.
Some benchmark on the xxHash site : https://github.com/Cyan4973/xxHash

PS: I've not yet used it.


You're doing something wrong (probably too small read buffer). On a machine of undecent age (Athlon 2x1800MP from 2002) that has DMA on disk probably out of whack (6.6M/s is damn slow when doing sequential reads):

Create a 1G file with "random" data:

# dd if=/dev/sdb of=temp.dat bs=1M count=1024    
1073741824 bytes (1.1 GB) copied, 161.698 s, 6.6 MB/s

# time sha1sum -b temp.dat
abb88a0081f5db999d0701de2117d2cb21d192a2 *temp.dat

1m5.299s

# time md5sum -b temp.dat
9995e1c1a704f9c1eb6ca11e7ecb7276 *temp.dat

1m58.832s

This is also weird, md5 is consistently slower than sha1 for me (reran several times).


Invoke the windows port of md5sum.exe. It's about two times as fast as the .NET implementation (at least on my machine using a 1.2 GB file)

public static string Md5SumByProcess(string file) {
    var p = new Process ();
    p.StartInfo.FileName = "md5sum.exe";
    p.StartInfo.Arguments = file;            
    p.StartInfo.UseShellExecute = false;
    p.StartInfo.RedirectStandardOutput = true;
    p.Start();
    p.WaitForExit();           
    string output = p.StandardOutput.ReadToEnd();
    return output.Split(' ')[0].Substring(1).ToUpper ();
}

As Anton Gogolev noted, FileStream reads 4096 bytes at a time by default, But you can specify any other value using the FileStream constructor:

new FileStream(file, FileMode.Open, FileAccess.Read, FileShare.ReadWrite, 16 * 1024 * 1024)

Note that Brad Abrams from Microsoft wrote in 2004:

there is zero benefit from wrapping a BufferedStream around a FileStream. We copied BufferedStream’s buffering logic into FileStream about 4 years ago to encourage better default performance

source


I did tests with buffer size, running this code

using (var stream = new BufferedStream(File.OpenRead(file), bufferSize))
{
    SHA256Managed sha = new SHA256Managed();
    byte[] checksum = sha.ComputeHash(stream);
    return BitConverter.ToString(checksum).Replace("-", String.Empty).ToLower();
}

And I tested with a file of 29½ GB in size, the results were

  • 10.000: 369,24s
  • 100.000: 362,55s
  • 1.000.000: 361,53s
  • 10.000.000: 434,15s
  • 100.000.000: 435,15s
  • 1.000.000.000: 434,31s
  • And 376,22s when using the original, none buffered code.

I am running an i5 2500K CPU, 12 GB ram and a OCZ Vertex 4 256 GB SSD drive.

So I thought, what about a standard 2TB harddrive. And the results were like this

  • 10.000: 368,52s
  • 100.000: 364,15s
  • 1.000.000: 363,06s
  • 10.000.000: 678,96s
  • 100.000.000: 617,89s
  • 1.000.000.000: 626,86s
  • And for none buffered 368,24

So I would recommend either no buffer or a buffer of max 1 mill.


Don't checksum the entire file, create checksums every 100mb or so, so each file has a collection of checksums.

Then when comparing checksums, you can stop comparing after the first different checksum, getting out early, and saving you from processing the entire file.

It'll still take the full time for identical files.


Examples related to c#

How can I convert this one line of ActionScript to C#? Microsoft Advertising SDK doesn't deliverer ads How to use a global array in C#? How to correctly write async method? C# - insert values from file into two arrays Uploading into folder in FTP? Are these methods thread safe? dotnet ef not found in .NET Core 3 HTTP Error 500.30 - ANCM In-Process Start Failure Best way to "push" into C# array

Examples related to .net

You must add a reference to assembly 'netstandard, Version=2.0.0.0 How to use Bootstrap 4 in ASP.NET Core No authenticationScheme was specified, and there was no DefaultChallengeScheme found with default authentification and custom authorization .net Core 2.0 - Package was restored using .NetFramework 4.6.1 instead of target framework .netCore 2.0. The package may not be fully compatible Update .NET web service to use TLS 1.2 EF Core add-migration Build Failed What is the difference between .NET Core and .NET Standard Class Library project types? Visual Studio 2017 - Could not load file or assembly 'System.Runtime, Version=4.1.0.0' or one of its dependencies Nuget connection attempt failed "Unable to load the service index for source" Token based authentication in Web API without any user interface

Examples related to large-files

How can I import a large (14 GB) MySQL dump file into a new MySQL database? Read and parse a Json File in C# How to find the largest file in a directory and its subdirectories? Reading large text files with streams in C# Working with huge files in VIM What is the fastest way to create a checksum for large files in C# How to read large text file on windows? Managing large binary files with Git Number of lines in a file in Java Text editor to open big (giant, huge, large) text files

Examples related to checksum

Windows equivalent of linux cksum command Generating an MD5 checksum of a file How is a CRC32 checksum calculated? How to compare 2 files fast using .NET? What is the fastest way to create a checksum for large files in C# What is the best way to calculate a checksum for a file that is on my machine? Getting a File's MD5 Checksum in Java