[c#] How to write super-fast file-streaming code in C#?

I have to split a huge file into many smaller files. Each of the destination files is defined by an offset and length as the number of bytes. I'm using the following code:

private void copy(string srcFile, string dstFile, int offset, int length)
{
    BinaryReader reader = new BinaryReader(File.OpenRead(srcFile));
    reader.BaseStream.Seek(offset, SeekOrigin.Begin);
    byte[] buffer = reader.ReadBytes(length);

    BinaryWriter writer = new BinaryWriter(File.OpenWrite(dstFile));
    writer.Write(buffer);
}

Considering that I have to call this function about 100,000 times, it is remarkably slow.

  1. Is there a way to make the Writer connected directly to the Reader? (That is, without actually loading the contents into the Buffer in memory.)

This question is related to c# performance streaming cpu utilization

The answer is


The first thing I would recommend is to take measurements. Where are you losing your time? Is it in the read, or the write?

Over 100,000 accesses (sum the times): How much time is spent allocating the buffer array? How much time is spent opening the file for read (is it the same file every time?) How much time is spent in read and write operations?

If you aren't doing any type of transformation on the file, do you need a BinaryWriter, or can you use a filestream for writes? (try it, do you get identical output? does it save time?)


Using FileStream + StreamWriter I know it's possible to create massive files in little time (less than 1 min 30 seconds). I generate three files totaling 700+ megabytes from one file using that technique.

Your primary problem with the code you're using is that you are opening a file every time. That is creating file I/O overhead.

If you knew the names of the files you would be generating ahead of time, you could extract the File.OpenWrite into a separate method; it will increase the speed. Without seeing the code that determines how you are splitting the files, I don't think you can get much faster.


Have you considered using the CCR since you are writing to separate files you can do everything in parallel (read and write) and the CCR makes it very easy to do this.

static void Main(string[] args)
    {
        Dispatcher dp = new Dispatcher();
        DispatcherQueue dq = new DispatcherQueue("DQ", dp);

        Port<long> offsetPort = new Port<long>();

        Arbiter.Activate(dq, Arbiter.Receive<long>(true, offsetPort,
            new Handler<long>(Split)));

        FileStream fs = File.Open(file_path, FileMode.Open);
        long size = fs.Length;
        fs.Dispose();

        for (long i = 0; i < size; i += split_size)
        {
            offsetPort.Post(i);
        }
    }

    private static void Split(long offset)
    {
        FileStream reader = new FileStream(file_path, FileMode.Open, 
            FileAccess.Read);
        reader.Seek(offset, SeekOrigin.Begin);
        long toRead = 0;
        if (offset + split_size <= reader.Length)
            toRead = split_size;
        else
            toRead = reader.Length - offset;

        byte[] buff = new byte[toRead];
        reader.Read(buff, 0, (int)toRead);
        reader.Dispose();
        File.WriteAllBytes("c:\\out" + offset + ".txt", buff);
    }

This code posts offsets to a CCR port which causes a Thread to be created to execute the code in the Split method. This causes you to open the file multiple times but gets rid of the need for synchronization. You can make it more memory efficient but you'll have to sacrifice speed.


No one suggests threading? Writing the smaller files looks like text book example of where threads are useful. Set up a bunch of threads to create the smaller files. this way, you can create them all in parallel and you don't need to wait for each one to finish. My assumption is that creating the files(disk operation) will take WAY longer than splitting up the data. and of course you should verify first that a sequential approach is not adequate.


The fastest way to do file I/O from C# is to use the Windows ReadFile and WriteFile functions. I have written a C# class that encapsulates this capability as well as a benchmarking program that looks at differnet I/O methods, including BinaryReader and BinaryWriter. See my blog post at:

http://designingefficientsoftware.wordpress.com/2011/03/03/efficient-file-io-from-csharp/


You shouldn't re-open the source file each time you do a copy, better open it once and pass the resulting BinaryReader to the copy function. Also, it might help if you order your seeks, so you don't make big jumps inside the file.

If the lengths aren't too big, you can also try to group several copy calls by grouping offsets that are near to each other and reading the whole block you need for them, for example:

offset = 1234, length = 34
offset = 1300, length = 40
offset = 1350, length = 1000

can be grouped to one read:

offset = 1234, length = 1074

Then you only have to "seek" in your buffer and can write the three new files from there without having to read again.


How large is length? You may do better to re-use a fixed sized (moderately large, but not obscene) buffer, and forget BinaryReader... just use Stream.Read and Stream.Write.

(edit) something like:

private static void copy(string srcFile, string dstFile, int offset,
     int length, byte[] buffer)
{
    using(Stream inStream = File.OpenRead(srcFile))
    using (Stream outStream = File.OpenWrite(dstFile))
    {
        inStream.Seek(offset, SeekOrigin.Begin);
        int bufferLength = buffer.Length, bytesRead;
        while (length > bufferLength &&
            (bytesRead = inStream.Read(buffer, 0, bufferLength)) > 0)
        {
            outStream.Write(buffer, 0, bytesRead);
            length -= bytesRead;
        }
        while (length > 0 &&
            (bytesRead = inStream.Read(buffer, 0, length)) > 0)
        {
            outStream.Write(buffer, 0, bytesRead);
            length -= bytesRead;
        }
    }        
}

(For future reference.)

Quite possibly the fastest way to do this would be to use memory mapped files (so primarily copying memory, and the OS handling the file reads/writes via its paging/memory management).

Memory Mapped files are supported in managed code in .NET 4.0.

But as noted, you need to profile, and expect to switch to native code for maximum performance.


Examples related to c#

How can I convert this one line of ActionScript to C#? Microsoft Advertising SDK doesn't deliverer ads How to use a global array in C#? How to correctly write async method? C# - insert values from file into two arrays Uploading into folder in FTP? Are these methods thread safe? dotnet ef not found in .NET Core 3 HTTP Error 500.30 - ANCM In-Process Start Failure Best way to "push" into C# array

Examples related to performance

Why is 2 * (i * i) faster than 2 * i * i in Java? What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism? How to check if a key exists in Json Object and get its value Why does C++ code for testing the Collatz conjecture run faster than hand-written assembly? Most efficient way to map function over numpy array The most efficient way to remove first N elements in a list? Fastest way to get the first n elements of a List into an Array Why is "1000000000000000 in range(1000000000000001)" so fast in Python 3? pandas loc vs. iloc vs. at vs. iat? Android Recyclerview vs ListView with Viewholder

Examples related to streaming

How to set bot's status Chrome hangs after certain amount of data transfered - waiting for available socket Best approach to real time http streaming to HTML5 video client Playing m3u8 Files with HTML Video Tag Live-stream video from one android phone to another over WiFi Install apk without downloading an attempt was made to access a socket in a way forbbiden by its access permissions. why? Download/Stream file from URL - asp.net Android video streaming example What is the difference between RTP or RTSP in a streaming server?

Examples related to cpu

Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2 Difference between core and processor How to enable support of CPU virtualization on Macbook Pro? How to get overall CPU usage (e.g. 57%) on Linux How to obtain the number of CPUs/cores in Linux from the command line? How to create a CPU spike with a bash command How to fast get Hardware-ID in C#? Optimal number of threads per core Linux Process States How to write super-fast file-streaming code in C#?

Examples related to utilization

How to write super-fast file-streaming code in C#?