Reading large text files with streams in C

Question

I ve got the lovely task of working out how to handle large files being loaded into our application s script editor  it s like VBA for our internal product for quick macros   Most files are about 300-400 nbsp KB which is fine loading  But when they go beyond 100 nbsp MB the process has a hard time  as you d expect    What happens is that the file is read and shoved into a RichTextBox which is then navigated - don t worry too much about this part   The developer who wrote the initial code is simply using a StreamReader and doing   Reader  ReadToEnd     which could take quite a while to complete   My task is to break this bit of code up  read it in chunks into a buffer and show a progressbar with an option to cancel it   Some assumptions    Most files will be 30-40 nbsp MB The contents of the file is text  not binary   some are Unix format  some are DOS  Once the contents is retrieved we work out what terminator is used  No-one s concerned once it s loaded the time it takes to render in the richtextbox  It s just the initial load of the text    Now for the questions    Can I simply use StreamReader  then check the Length property  so ProgressMax  and issue a Read for a set buffer size and iterate through in a while loop WHILST inside a background worker  so it doesn t block the main UI thread  Then return the stringbuilder to the main thread once it s completed  The contents will be going to a StringBuilder  can I initialise the StringBuilder with the size of the stream if the length is available    Are these  in your professional opinions  good ideas  I ve had a few issues in the past with reading content from Streams  because it will always miss the last few bytes or something  but I ll ask another question if this is the case

User · Answer

An iterator might be perfect for this type of work:

public static IEnumerable<int> LoadFileWithProgress(string filename, StringBuilder stringData)
{
    const int charBufferSize = 4096;
    using (FileStream fs = File.OpenRead(filename))
    {
        using (BinaryReader br = new BinaryReader(fs))
        {
            long length = fs.Length;
            int numberOfChunks = Convert.ToInt32((length / charBufferSize)) + 1;
            double iter = 100 / Convert.ToDouble(numberOfChunks);
            double currentIter = 0;
            yield return Convert.ToInt32(currentIter);
            while (true)
            {
                char[] buffer = br.ReadChars(charBufferSize);
                if (buffer.Length == 0) break;
                stringData.Append(buffer);
                currentIter += iter;
                yield return Convert.ToInt32(currentIter);
            }
        }
    }
}

You can call it using the following:

string filename = "C:\\myfile.txt";
StringBuilder sb = new StringBuilder();
foreach (int progress in LoadFileWithProgress(filename, sb))
{
    // Update your progress counter here!
}
string fileData = sb.ToString();

As the file is loaded, the iterator will return the progress number from 0 to 100, which you can use to update your progress bar. Once the loop has finished, the StringBuilder will contain the contents of the text file.

Also, because you want text, we can just use BinaryReader to read in characters, which will ensure that your buffers line up correctly when reading any multi-byte characters (UTF-8, UTF-16, etc.).

This is all done without using background tasks, threads, or complex custom state machines.

User · Answer

If you read the performance and benchmark stats on this website  you ll see that the fastest way to read  because reading  writing  and processing are all different  a text file is the following snippet of code   using  StreamReader sr   File OpenText fileName         string s   String Empty      while   s   sr ReadLine       null                  do your stuff here           All up about 9 different methods were bench marked  but that one seem to come out ahead the majority of the time  even out performing the buffered reader as other readers have mentioned

User · Answer

Have a look at the following code snippet  You have mentioned Most files will be 30-40 MB  This claims to read 180 nbsp MB in 1 4 seconds on an Intel Quad Core   private int  bufferSize   16384   private void ReadFile string filename        StringBuilder stringBuilder   new StringBuilder        FileStream fileStream   new FileStream filename  FileMode Open  FileAccess Read        using  StreamReader streamReader   new StreamReader fileStream                 char   fileContents   new char  bufferSize           int charsRead   streamReader Read fileContents  0   bufferSize               Can t do much with 0 bytes         if  charsRead    0              throw new Exception  File is 0 bytes             while  charsRead  gt  0                        stringBuilder Append fileContents               charsRead   streamReader Read fileContents  0   bufferSize                       Original Article

User · Answer

Use a background worker and read only a limited number of lines  Read more only when the user scrolls   And try to never use ReadToEnd    It s one of the functions that you think  why did they make it    it s a script kiddies  helper that goes fine with small things  but as you see  it sucks for large files     Those guys telling you to use StringBuilder need to read the MSDN more often   Performance Considerations The Concat and AppendFormat methods both concatenate new data to an existing String or StringBuilder object  A String object concatenation operation always creates a new object from the existing string and the new data  A StringBuilder object maintains a buffer to accommodate the concatenation of new data  New data is appended to the end of the buffer if room is available  otherwise  a new  larger buffer is allocated  data from the original buffer is copied to the new buffer  then the new data is appended to the new buffer  The performance of a concatenation operation for a String or StringBuilder object depends on how often a memory allocation occurs   A String concatenation operation always allocates memory  whereas a StringBuilder concatenation operation only allocates memory if the StringBuilder object buffer is too small to accommodate the new data  Consequently  the String class is preferable for a concatenation operation if a fixed number of String objects are concatenated  In that case  the individual concatenation operations might even be combined into a single operation by the compiler  A StringBuilder object is preferable for a concatenation operation if an arbitrary number of strings are concatenated  for example  if a loop concatenates a random number of strings of user input   That means huge allocation of memory  what becomes large use of swap files system  that simulates sections of your hard disk drive to act like the RAM memory  but a hard disk drive is very slow   The StringBuilder option looks fine for who use the system as a mono-user  but when you have two or more users reading large files at the same time  you have a problem

User · Answer

My file is over 13 GB    The bellow link contains the code that read a piece of file easily   Read a large text file  More information

User · Answer

All excellent answers  however  for someone looking for an answer  these appear to be somewhat incomplete   As a standard String can only of Size X  2Gb to 4Gb depending on your configuration  these answers do not really fulfil the OP s question  One method is to work with a List of Strings   List lt string gt  Words   new List lt string gt      using  StreamReader sr   new StreamReader   C  Temp file txt       string line   string Empty   while   line   sr ReadLine       null        Words Add line         Some may want to Tokenise and split the line when processing  The String List now can contain very large volumes of Text

User · Answer

You might be better off to use memory-mapped files handling here   The memory mapped file support will be around in  NET 4  I think   I heard that through someone else talking about it   hence this wrapper which uses p invokes to do the same job    Edit  See here on the MSDN for how it works  here s the blog entry indicating how it is done in the upcoming  NET 4 when it comes out as release  The link I have given earlier on is a wrapper around the pinvoke to achieve this  You can map the entire file into memory  and view it like a sliding window when scrolling through the file

User · Answer

For binary files  the fastest way of reading them I have found is this    MemoryMappedFile mmf   MemoryMappedFile CreateFromFile file    MemoryMappedViewStream mms   mmf CreateViewStream     using  BinaryReader b   new BinaryReader mms           In my tests it s hundreds of times faster

User · Answer

You say you have been asked to show a progress bar while a large file is loading  Is that because the users genuinely want to see the exact   of file loading  or just because they want visual feedback that something is happening   If the latter is true  then the solution becomes much simpler  Just do reader ReadToEnd   on a background thread  and display a marquee-type progress bar instead of a proper one   I raise this point because in my experience this is often the case  When you are writing a data processing program  then users will definitely be interested in a   complete figure  but for simple-but-slow UI updates  they are more likely to just want to know that the computer hasn t crashed   -

User · Answer

You can improve read speed by using a BufferedStream  like this   using  FileStream fs   File Open path  FileMode Open  FileAccess Read  FileShare ReadWrite   using  BufferedStream bs   new BufferedStream fs   using  StreamReader sr   new StreamReader bs         string line      while   line   sr ReadLine       null                   March 2013 UPDATE  I recently wrote code for reading and processing  searching for text in  1 nbsp GB-ish text files  much larger than the files involved here  and achieved a significant performance gain by using a producer consumer pattern  The producer task read in lines of text using the BufferedStream and handed them off to a separate consumer task that did the searching   I used this as an opportunity to learn TPL Dataflow  which is very well suited for quickly coding this pattern   Why BufferedStream is faster     A buffer is a block of bytes in memory used to cache data  thereby reducing the number of calls to the operating system  Buffers improve read and write performance  A buffer can be used for either reading or writing  but never both simultaneously  The Read and Write methods of BufferedStream automatically maintain the buffer    December 2014 UPDATE  Your Mileage May Vary  Based on the comments  FileStream should be using a BufferedStream internally  At the time this answer was first provided  I measured a significant performance boost by adding a BufferedStream  At the time I was targeting  NET 3 x on a 32-bit platform  Today  targeting  NET 4 5 on a 64-bit platform  I do not see any improvement   Related  I came across a case where streaming a large  generated CSV file to the Response stream from an ASP Net MVC action was very slow   Adding a BufferedStream improved performance by 100x in this instance   For more see Unbuffered Output Very Slow

User · Answer

Whilst the most upvoted answer is correct but it lacks usage of multi-core processing  In my case  having 12 cores I use PLink  Parallel ForEach      File ReadLines filename     returns IEumberable lt string gt   lazy-loading     new ParallelOptions   MaxDegreeOfParallelism   Environment ProcessorCount         line  state  index    gt                  process line value           Worth mentioning  I got that as an interview question asking return Top 10 most occurrences  var result   new ConcurrentDictionary lt string  int gt  StringComparer InvariantCultureIgnoreCase   Parallel ForEach      File ReadLines filename       new ParallelOptions   MaxDegreeOfParallelism   Environment ProcessorCount         line  state  index    gt                result AddOrUpdate line  1   key  val    gt  val   1                     return result      OrderByDescending x   gt  x Value       Take 10       Select x   gt  x Value    Benchmarking  BenchmarkDotNet v0 12 1  OS Windows 10 0 19042 Intel Core i7-8700K CPU 3 70GHz  Coffee Lake   1 CPU  12 logical and 6 physical cores  Host         NET Framework 4 8  4 8 4250 0   X64 RyuJIT DefaultJob    NET Framework 4 8  4 8 4250 0   X64 RyuJIT      Method Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated     GetTopWordsSync 33 03 s 0 175 s 0 155 s 1194000 314000 7000 7 06 GB   GetTopWordsParallel 10 89 s 0 121 s 0 113 s 1225000 354000 8000 7 18 GB     And as you can see it s 75  performance improvement

User · Answer

This should be enough to get you started   class Program               static void Main String   args                const int bufferSize   1024           var sb   new StringBuilder            var buffer   new Char bufferSize           var length   0L          var totalRead   0L          var count   bufferSize            using  var sr   new StreamReader   C  Temp file txt                          length   sr BaseStream Length                             while  count  gt  0                                                    count   sr Read buffer  0  bufferSize                   sb Append buffer  0  count                   totalRead    count                                                   Console ReadKey

[c#] Reading large text files with streams in C#

Examples related to c#

Examples related to .net

Examples related to stream

Examples related to streamreader

Examples related to large-files