[c#] How can I detect the encoding/codepage of a text file

In our application, we receive text files (.txt, .csv, etc.) from diverse sources. When reading, these files sometimes contain garbage, because the files where created in a different/unknown codepage.

Is there a way to (automatically) detect the codepage of a text file?

The detectEncodingFromByteOrderMarks, on the StreamReader constructor, works for UTF8 and other unicode marked files, but I'm looking for a way to detect code pages, like ibm850, windows1252.


Thanks for your answers, this is what I've done.

The files we receive are from end-users, they do not have a clue about codepages. The receivers are also end-users, by now this is what they know about codepages: Codepages exist, and are annoying.

Solution:

  • Open the received file in Notepad, look at a garbled piece of text. If somebody is called Fran├žois or something, with your human intelligence you can guess this.
  • I've created a small app that the user can use to open the file with, and enter a text that user knows it will appear in the file, when the correct codepage is used.
  • Loop through all codepages, and display the ones that give a solution with the user provided text.
  • If more as one codepage pops up, ask the user to specify more text.

This question is related to c# .net text encoding globalization

The answer is


You can't detect the codepage, you need to be told it. You can analyse the bytes and guess it, but that can give some bizarre (sometimes amusing) results. I can't find it now, but I'm sure Notepad can be tricked into displaying English text in Chinese.

Anyway, this is what you need to read: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

Specifically Joel says:

The Single Most Important Fact About Encodings

If you completely forget everything I just explained, please remember one extremely important fact. It does not make sense to have a string without knowing what encoding it uses. You can no longer stick your head in the sand and pretend that "plain" text is ASCII. There Ain't No Such Thing As Plain Text.

If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly.


If you're looking to detect non-UTF encodings (i.e. no BOM), you're basically down to heuristics and statistical analysis of the text. You might want to take a look at the Mozilla paper on universal charset detection (same link, with better formatting via Wayback Machine).


Have you tried C# port for Mozilla Universal Charset Detector

Example from http://code.google.com/p/ude/

public static void Main(String[] args)
{
    string filename = args[0];
    using (FileStream fs = File.OpenRead(filename)) {
        Ude.CharsetDetector cdet = new Ude.CharsetDetector();
        cdet.Feed(fs);
        cdet.DataEnd();
        if (cdet.Charset != null) {
            Console.WriteLine("Charset: {0}, confidence: {1}", 
                 cdet.Charset, cdet.Confidence);
        } else {
            Console.WriteLine("Detection failed.");
        }
    }
}    

You can't detect the codepage

This is clearly false. Every web browser has some kind of universal charset detector to deal with pages which have no indication whatsoever of an encoding. Firefox has one. You can download the code and see how it does it. See some documentation here. Basically, it is a heuristic, but one that works really well.

Given a reasonable amount of text, it is even possible to detect the language.

Here's another one I just found using Google:


I know it's very late for this question and this solution won't appeal to some (because of its english-centric bias and its lack of statistical/empirical testing), but it's worked very well for me, especially for processing uploaded CSV data:

http://www.architectshack.com/TextFileEncodingDetector.ashx

Advantages:

  • BOM detection built-in
  • Default/fallback encoding customizable
  • pretty reliable (in my experience) for western-european-based files containing some exotic data (eg french names) with a mixture of UTF-8 and Latin-1-style files - basically the bulk of US and western european environments.

Note: I'm the one who wrote this class, so obviously take it with a grain of salt! :)


If someone is looking for a 93.9% solution. This works for me:

public static class StreamExtension
{
    /// <summary>
    /// Convert the content to a string.
    /// </summary>
    /// <param name="stream">The stream.</param>
    /// <returns></returns>
    public static string ReadAsString(this Stream stream)
    {
        var startPosition = stream.Position;
        try
        {
            // 1. Check for a BOM
            // 2. or try with UTF-8. The most (86.3%) used encoding. Visit: http://w3techs.com/technologies/overview/character_encoding/all/
            var streamReader = new StreamReader(stream, new UTF8Encoding(encoderShouldEmitUTF8Identifier: false, throwOnInvalidBytes: true), detectEncodingFromByteOrderMarks: true);
            return streamReader.ReadToEnd();
        }
        catch (DecoderFallbackException ex)
        {
            stream.Position = startPosition;

            // 3. The second most (6.7%) used encoding is ISO-8859-1. So use Windows-1252 (0.9%, also know as ANSI), which is a superset of ISO-8859-1.
            var streamReader = new StreamReader(stream, Encoding.GetEncoding(1252));
            return streamReader.ReadToEnd();
        }
    }
}

Notepad++ has this feature out-of-the-box. It also supports changing it.


Looking for different solution, I found that

https://code.google.com/p/ude/

this solution is kinda heavy.

I needed some basic encoding detection, based on 4 first bytes and probably xml charset detection - so I've took some sample source code from internet and added slightly modified version of

http://lists.w3.org/Archives/Public/www-validator/2002Aug/0084.html

written for Java.

    public static Encoding DetectEncoding(byte[] fileContent)
    {
        if (fileContent == null)
            throw new ArgumentNullException();

        if (fileContent.Length < 2)
            return Encoding.ASCII;      // Default fallback

        if (fileContent[0] == 0xff
            && fileContent[1] == 0xfe
            && (fileContent.Length < 4
                || fileContent[2] != 0
                || fileContent[3] != 0
                )
            )
            return Encoding.Unicode;

        if (fileContent[0] == 0xfe
            && fileContent[1] == 0xff
            )
            return Encoding.BigEndianUnicode;

        if (fileContent.Length < 3)
            return null;

        if (fileContent[0] == 0xef && fileContent[1] == 0xbb && fileContent[2] == 0xbf)
            return Encoding.UTF8;

        if (fileContent[0] == 0x2b && fileContent[1] == 0x2f && fileContent[2] == 0x76)
            return Encoding.UTF7;

        if (fileContent.Length < 4)
            return null;

        if (fileContent[0] == 0xff && fileContent[1] == 0xfe && fileContent[2] == 0 && fileContent[3] == 0)
            return Encoding.UTF32;

        if (fileContent[0] == 0 && fileContent[1] == 0 && fileContent[2] == 0xfe && fileContent[3] == 0xff)
            return Encoding.GetEncoding(12001);

        String probe;
        int len = fileContent.Length;

        if( fileContent.Length >= 128 ) len = 128;
        probe = Encoding.ASCII.GetString(fileContent, 0, len);

        MatchCollection mc = Regex.Matches(probe, "^<\\?xml[^<>]*encoding[ \\t\\n\\r]?=[\\t\\n\\r]?['\"]([A-Za-z]([A-Za-z0-9._]|-)*)", RegexOptions.Singleline);
        // Add '[0].Groups[1].Value' to the end to test regex

        if( mc.Count == 1 && mc[0].Groups.Count >= 2 )
        {
            // Typically picks up 'UTF-8' string
            Encoding enc = null;

            try {
                enc = Encoding.GetEncoding( mc[0].Groups[1].Value );
            }catch (Exception ) { }

            if( enc != null )
                return enc;
        }

        return Encoding.ASCII;      // Default fallback
    }

It's enough to read probably first 1024 bytes from file, but I'm loading whole file.


I've done something similar in Python. Basically, you need lots of sample data from various encodings, which are broken down by a sliding two-byte window and stored in a dictionary (hash), keyed on byte-pairs providing values of lists of encodings.

Given that dictionary (hash), you take your input text and:

  • if it starts with any BOM character ('\xfe\xff' for UTF-16-BE, '\xff\xfe' for UTF-16-LE, '\xef\xbb\xbf' for UTF-8 etc), I treat it as suggested
  • if not, then take a large enough sample of the text, take all byte-pairs of the sample and choose the encoding that is the least common suggested from the dictionary.

If you've also sampled UTF encoded texts that do not start with any BOM, the second step will cover those that slipped from the first step.

So far, it works for me (the sample data and subsequent input data are subtitles in various languages) with diminishing error rates.


The tool "uchardet" does this well using character frequency distribution models for each charset. Larger files and more "typical" files have more confidence (obviously).

On ubuntu, you just apt-get install uchardet.

On other systems, get the source, usage & docs here: https://github.com/BYVoid/uchardet


I've done something similar in Python. Basically, you need lots of sample data from various encodings, which are broken down by a sliding two-byte window and stored in a dictionary (hash), keyed on byte-pairs providing values of lists of encodings.

Given that dictionary (hash), you take your input text and:

  • if it starts with any BOM character ('\xfe\xff' for UTF-16-BE, '\xff\xfe' for UTF-16-LE, '\xef\xbb\xbf' for UTF-8 etc), I treat it as suggested
  • if not, then take a large enough sample of the text, take all byte-pairs of the sample and choose the encoding that is the least common suggested from the dictionary.

If you've also sampled UTF encoded texts that do not start with any BOM, the second step will cover those that slipped from the first step.

So far, it works for me (the sample data and subsequent input data are subtitles in various languages) with diminishing error rates.


The StreamReader class's constructor takes a 'detect encoding' parameter.


If you can link to a C library, you can use libenca. See http://cihar.com/software/enca/. From the man page:

Enca reads given text files, or standard input when none are given, and uses knowledge about their language (must be supported by you) and a mixture of parsing, statistical analysis, guessing and black magic to determine their encodings.

It's GPL v2.


Thanks @Erik Aronesty for mentioning uchardet.

Meanwhile the (same?) tool exists for linux: chardet.
Or, on cygwin you may want to use: chardetect.

See: chardet man page: https://www.commandlinux.com/man-page/man1/chardetect.1.html

This will heuristically detect (guess) the character encoding for each given file and will report the name and confidence level for each file's detected character encoding.


I was actually looking for a generic, not programming way of detecting the file encoding, but I didn't find that yet. What I did find by testing with different encodings was that my text was UTF-7.

So where I first was doing: StreamReader file = File.OpenText(fullfilename);

I had to change it to: StreamReader file = new StreamReader(fullfilename, System.Text.Encoding.UTF7);

OpenText assumes it's UTF-8.

you can also create the StreamReader like this new StreamReader(fullfilename, true), the second parameter meaning that it should try and detect the encoding from the byteordermark of the file, but that didn't work in my case.


Got the same problem but didn't found a good solution yet for detecting it automatically . Now im using PsPad (www.pspad.com) for that ;) Works fine


As addon to ITmeze post, I've used this function to convert the output of C# port for Mozilla Universal Charset Detector

    private Encoding GetEncodingFromString(string codePageName)
    {
        try
        {
            return Encoding.GetEncoding(codePageName);
        }
        catch
        {
            return Encoding.ASCII;
        }
    }

MSDN


Since it basically comes down to heuristics, it may help to use the encoding of previously received files from the same source as a first hint.

Most people (or applications) do stuff in pretty much the same order every time, often on the same machine, so its quite likely that when Bob creates a .csv file and sends it to Mary it'll always be using Windows-1252 or whatever his machine defaults to.

Where possible a bit of customer training never hurts either :-)


Got the same problem but didn't found a good solution yet for detecting it automatically . Now im using PsPad (www.pspad.com) for that ;) Works fine


Open file in AkelPad(or just copy/paste a garbled text), go to Edit -> Selection -> Recode... -> check "Autodetect".


I use this code to detect Unicode and windows default ansi codepage when reading a file. For other codings a check of content is necessary, manually or by programming. This can de used to save the text with the same encoding as when it was opened. (I use VB.NET)

'Works for Default and unicode (auto detect)
Dim mystreamreader As New StreamReader(LocalFileName, Encoding.Default) 
MyEditTextBox.Text = mystreamreader.ReadToEnd()
Debug.Print(mystreamreader.CurrentEncoding.CodePage) 'Autodetected encoding
mystreamreader.Close()

10Y (!) had passed since this was asked, and still I see no mention of MS's good, non-GPL'ed solution: IMultiLanguage2 API.

Most libraries already mentioned are based on Mozilla's UDE - and it seems reasonable that browsers have already tackled similar problems. I don't know what is chrome's solution, but since IE 5.0 MS have released theirs, and it is:

  1. Free of GPL-and-the-like licensing issues,
  2. Backed and maintained probably forever,
  3. Gives rich output - all valid candidates for encoding/codepages along with confidence scores,
  4. Surprisingly easy to use (it is a single function call).

It is a native COM call, but here's some very nice work by Carsten Zeumer, that handles the interop mess for .net usage. There are some others around, but by and large this library doesn't get the attention it deserves.


Questions with c# tag:

How can I convert this one line of ActionScript to C#? Microsoft Advertising SDK doesn't deliverer ads How to use a global array in C#? How to correctly write async method? C# - insert values from file into two arrays Uploading into folder in FTP? Are these methods thread safe? dotnet ef not found in .NET Core 3 HTTP Error 500.30 - ANCM In-Process Start Failure Best way to "push" into C# array How can I add raw data body to an axios request? Couldn't process file resx due to its being in the Internet or Restricted zone or having the mark of the web on the file Convert string to boolean in C# Entity Framework Core: A second operation started on this context before a previous operation completed ASP.NET Core - Swashbuckle not creating swagger.json file Is ConfigurationManager.AppSettings available in .NET Core 2.0? No authenticationScheme was specified, and there was no DefaultChallengeScheme found with default authentification and custom authorization Getting value from appsettings.json in .net core .net Core 2.0 - Package was restored using .NetFramework 4.6.1 instead of target framework .netCore 2.0. The package may not be fully compatible Automatically set appsettings.json for dev and release environments in asp.net core? How to use log4net in Asp.net core 2.0 Get ConnectionString from appsettings.json instead of being hardcoded in .NET Core 2.0 App Unable to create migrations after upgrading to ASP.NET Core 2.0 Update .NET web service to use TLS 1.2 Using app.config in .Net Core How to send json data in POST request using C# ASP.NET Core form POST results in a HTTP 415 Unsupported Media Type response How to enable CORS in ASP.net Core WebAPI VS 2017 Metadata file '.dll could not be found How to set combobox default value? How to get root directory of project in asp.net core. Directory.GetCurrentDirectory() doesn't seem to work correctly on a mac ALTER TABLE DROP COLUMN failed because one or more objects access this column Error: the entity type requires a primary key How to POST using HTTPclient content type = application/x-www-form-urlencoded CORS: credentials mode is 'include' Visual Studio 2017: Display method references Where is NuGet.Config file located in Visual Studio project? Unity Scripts edited in Visual studio don't provide autocomplete How to create roles in ASP.NET Core and assign them to users? Return file in ASP.Net Core Web API ASP.NET Core return JSON with status code auto create database in Entity Framework Core Class Diagrams in VS 2017 How to read/write files in .Net Core? How to read values from the querystring with ASP.NET Core? how to set ASPNETCORE_ENVIRONMENT to be considered for publishing an asp.net core application? ASP.NET Core Get Json Array using IConfiguration Entity Framework Core add unique constraint code-first No templates in Visual Studio 2017 ps1 cannot be loaded because running scripts is disabled on this system

Questions with .net tag:

You must add a reference to assembly 'netstandard, Version=2.0.0.0 How to use Bootstrap 4 in ASP.NET Core No authenticationScheme was specified, and there was no DefaultChallengeScheme found with default authentification and custom authorization .net Core 2.0 - Package was restored using .NetFramework 4.6.1 instead of target framework .netCore 2.0. The package may not be fully compatible Update .NET web service to use TLS 1.2 EF Core add-migration Build Failed What is the difference between .NET Core and .NET Standard Class Library project types? Visual Studio 2017 - Could not load file or assembly 'System.Runtime, Version=4.1.0.0' or one of its dependencies Nuget connection attempt failed "Unable to load the service index for source" Token based authentication in Web API without any user interface How to decode JWT Token? What's the difference between .NET Core, .NET Framework, and Xamarin? .NET Core vs Mono How to specify the port an ASP.NET Core application is hosted on? WebForms UnobtrusiveValidationMode requires a ScriptResourceMapping for jquery Why is the Visual Studio 2015/2017/2019 Test Runner not discovering my xUnit v2 tests Build error, This project references NuGet IIS Config Error - This configuration section cannot be used at this path There is no argument given that corresponds to the required formal parameter - .NET Error Could not find a part of the path ... bin\roslyn\csc.exe Can you install and run apps built on the .NET framework on a Mac? What is the purpose of nameof? Microsoft.ReportViewer.Common Version=12.0.0.0 What is the default value for Guid? Authentication failed because remote party has closed the transport stream What is the difference between a Shared Project and a Class Library in Visual Studio 2015? How to break out of the IF statement No connection could be made because the target machine actively refused it 127.0.0.1 Default SecurityProtocol in .NET 4.5 How to properly make a http web GET request What is ".NET Core"? IOException: The process cannot access the file 'file path' because it is being used by another process System.web.mvc missing Disable SSL fallback and use only TLS for outbound connections in .NET? (Poodle mitigation) SQL Connection Error: System.Data.SqlClient.SqlException (0x80131904) Write to Windows Application Event Log Operator overloading ==, !=, Equals System.Net.WebException: The remote name could not be resolved: Running multiple async tasks and waiting for them all to complete What is an "index out of range" exception, and how do I fix it? Found conflicts between different versions of the same dependent assembly that could not be resolved Getting the first and last day of a month, using a given DateTime object Error 1053 the service did not respond to the start or control request in a timely fashion BadImageFormatException. This will occur when running in 64 bit mode with the 32 bit Oracle client components installed Download files from SFTP with SSH.NET library An object reference is required to access a non-static member AppendChild() is not a function javascript Send JSON via POST in C# and Receive the JSON returned? Why is HttpClient BaseAddress not working? KERNELBASE.dll Exception 0xe0434352 offset 0x000000000000a49d

Questions with text tag:

Difference between opening a file in binary vs text How do I center text vertically and horizontally in Flutter? How to `wget` a list of URLs in a text file? Convert txt to csv python script Reading local text file into a JavaScript array Python: How to increase/reduce the fontsize of x and y tick labels? How can I insert a line break into a <Text> component in React Native? How to split large text file in windows? Copy text from nano editor to shell Atom menu is missing. How do I re-enable Setting a max character length in CSS Android EditText view Floating Hint in Material Design Difference between VARCHAR and TEXT in MySQL Editing legend (text) labels in ggplot Extracting text OpenCV Input type "number" won't resize How to display text in pygame? How can I use a batch file to write to a text file? Basic text editor in command prompt? How to count the number of words in a sentence, ignoring numbers, punctuation and whitespace? How to remove text before | character in notepad++ how to customise input field width in bootstrap 3 How to set text size in a button in html How do I append text to a file? Writing new lines to a text file in PowerShell How to read existing text files without defining path How to place Text and an Image next to each other in HTML? Changing background color of text box input not working when empty Indent starting from the second line of a paragraph with CSS Align text to the bottom of a div How do I find all files containing specific text on Linux? Find specific string in a text file with VBS script How to convert text column to datetime in SQL Output grep results to text file, need cleaner output Javascript change color of text and background to input value Using BufferedReader to read Text File Saving a text file on server using JavaScript Java: print contents of text file to screen How to add text to an existing div with jquery Making text background transparent but not text itself How to center a <p> element inside a <div> container? How to read a text file into a list or an array with Python Matplotlib scatter plot with different text at each data point Text on image mouseover? how to align text vertically center in android How to read text file in JavaScript Text border using css (border around text) How to remove non UTF-8 characters from text file How to print Two-Dimensional Array like table Read a text file in R line by line

Questions with encoding tag:

How to check encoding of a CSV file UnicodeEncodeError: 'ascii' codec can't encode character at special name Using Javascript's atob to decode base64 doesn't properly decode utf-8 strings What is the difference between utf8mb4 and utf8 charsets in MySQL? The character encoding of the plain text document was not declared - mootool script UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 23: ordinal not in range(128) How to encode text to base64 in python UTF-8 output from PowerShell Set Encoding of File to UTF8 With BOM in Sublime Text 3 Replace non-ASCII characters with a single space UTF-8 in Windows 7 CMD UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 7: ordinal not in range(128) Attempt to set a non-property-list object as an NSUserDefaults How to write UTF-8 in a CSV file Easy way to convert a unicode list to a list containing python strings? SyntaxError of Non-ASCII character Byte and char conversion in Java Url decode UTF-8 in Python Set encoding and fileencoding to utf-8 in Vim Why does the PHP json_encode function convert UTF-8 strings to hexadecimal entities? How do I see the current encoding of a file in Sublime Text? Converting string to byte array in C# python encoding utf-8 reading text file with utf-8 encoding using java UnicodeEncodeError: 'charmap' codec can't encode - character maps to <undefined>, print function NodeJS: How to decode base64 encoded string back to binary? Unicode via CSS :before How can I transform string to UTF-8 in C#? Convert UTF-8 to base64 string java.sql.SQLException: Incorrect string value: '\xF0\x9F\x91\xBD\xF0\x9F...' How do I POST form data with UTF-8 encoding by using curl? "TypeError: (Integer) is not JSON serializable" when serializing JSON in Python? Conversion from byte array to base64 and back How to convert a string with Unicode encoding to a string of letters Java URL encoding of query string parameters Usage of unicode() and encode() functions in Python Setting PHP default encoding to utf-8? ArrayBuffer to base64 encoded string How to support UTF-8 encoding in Eclipse Java String encoding (UTF-8) utf-8 special characters not displaying How can I send and receive WebSocket messages on the server side? fileReader.readAsBinaryString to upload files In OS X Lion, LANG is not set to UTF-8, how to fix it? How do I decode a base64 encoded string? Why does a base64 encoded string have an = sign at the end Let JSON object accept bytes or let urlopen output strings Does "\d" in regex mean a digit? Working with UTF-8 encoding in Python source Convert String (UTF-16) to UTF-8 in C#

Questions with globalization tag:

Best practice for localization and globalization of strings and labels how to set default culture info for entire c# application String format currency How do I set the default locale in the JVM? Currency format for display Get current language in CultureInfo How to convert string to double with proper cultureinfo Regular expression for validating names and surnames? How can I detect the encoding/codepage of a text file