Effective way to find any file s Encoding

Question

Yes is a most frequent question  and this matter is vague for me and since I don t know much about it   But i would like a very precise way to find a files Encoding  So precise as Notepad   is

User · Answer

NET is not very helpful  but you can try the following algorithm    try to find the encoding by BOM byte order mark       very likely not to be found try parsing into different encodings   Here is the call   var encoding   FileHelper GetEncoding filePath   if  encoding    null      throw new Exception  The file encoding is not supported  Please choose one of the following encodings  UTF8 UTF7 iso-8859-1      Here is the code   public class FileHelper            lt summary gt          Determines a text file s encoding by analyzing its byte order mark  BOM  and if not found try parsing into diferent encodings                Defaults to UTF8 when detection of the text file s endianness fails           lt  summary gt           lt param name  filename  gt The text file to analyze  lt  param gt           lt returns gt The detected encoding or null  lt  returns gt      public static Encoding GetEncoding string filename                var encodingByBOM   GetEncodingByBOM filename           if  encodingByBOM    null              return encodingByBOM              BOM not found     so try to parse characters into several encodings         var encodingByParsingUTF8   GetEncodingByParsing filename  Encoding UTF8           if  encodingByParsingUTF8    null              return encodingByParsingUTF8           var encodingByParsingLatin1   GetEncodingByParsing filename  Encoding GetEncoding  iso-8859-1             if  encodingByParsingLatin1    null              return encodingByParsingLatin1           var encodingByParsingUTF7   GetEncodingByParsing filename  Encoding UTF7           if  encodingByParsingUTF7    null              return encodingByParsingUTF7           return null       no encoding found                 lt summary gt          Determines a text file s encoding by analyzing its byte order mark  BOM             lt  summary gt           lt param name  filename  gt The text file to analyze  lt  param gt           lt returns gt The detected encoding  lt  returns gt      private static Encoding GetEncodingByBOM string filename                   Read the BOM         var byteOrderMark   new byte 4           using  var file   new FileStream filename  FileMode Open  FileAccess Read                         file Read byteOrderMark  0  4                         Analyze the BOM         if  byteOrderMark 0     0x2b  amp  amp  byteOrderMark 1     0x2f  amp  amp  byteOrderMark 2     0x76  return Encoding UTF7          if  byteOrderMark 0     0xef  amp  amp  byteOrderMark 1     0xbb  amp  amp  byteOrderMark 2     0xbf  return Encoding UTF8          if  byteOrderMark 0     0xff  amp  amp  byteOrderMark 1     0xfe  return Encoding Unicode    UTF-16LE         if  byteOrderMark 0     0xfe  amp  amp  byteOrderMark 1     0xff  return Encoding BigEndianUnicode    UTF-16BE         if  byteOrderMark 0     0  amp  amp  byteOrderMark 1     0  amp  amp  byteOrderMark 2     0xfe  amp  amp  byteOrderMark 3     0xff  return Encoding UTF32           return null        no BOM found            private static Encoding GetEncodingByParsing string filename  Encoding encoding                            var encodingVerifier   Encoding GetEncoding encoding BodyName  new EncoderExceptionFallback    new DecoderExceptionFallback              try                       using  var textReader   new StreamReader filename  encodingVerifier  detectEncodingFromByteOrderMarks  true                                 while   textReader EndOfStream                                                                textReader ReadLine         in order to increment the stream position                                       all text parsed ok                 return textReader CurrentEncoding                                  catch  Exception ex               return null

User · Answer

It may be useful  string path     address to the file extension    using  StreamReader sr   new StreamReader path          Console WriteLine sr CurrentEncoding

User · Answer

The following codes are my Powershell codes to determinate if some cpp or h or ml files are encodeding with ISO-8859-1 Latin-1  or UTF-8 without BOM  if neither then suppose it to be GB18030  I am a Chinese working in France and MSVC saves as Latin-1 on french computer and saves as GB on Chinese computer so this helps me avoid encoding problem when do source file exchanges between my system and my colleagues   The way is simple  if all characters are between x00-x7E  ASCII  UTF-8 and Latin-1 are all the same  but if I read a non ASCII file by UTF-8  we will find the special character   show up  so try to read with Latin-1  In Latin-1  between  x7F and  xAF is empty  while GB uses full between x00-xFF so if I got any between the two  it s not Latin-1  The code is written in PowerShell  but uses  net so it s easy to be translated into C  or F    Utf8NoBomEncoding   New-Object System Text UTF8Encoding  False  foreach  i in Get-ChildItem    -Recurse -include   cpp   h    ml         openUTF   New-Object System IO StreamReader -ArgumentList   i   Text Encoding   UTF8       contentUTF    openUTF ReadToEnd        regex  regex            c  regex Matches  contentUTF  count      openUTF Close       if   c -ne 0             openLatin1   New-Object System IO StreamReader -ArgumentList   i   Text Encoding   GetEncoding  ISO-8859-1             contentLatin1    openLatin1 ReadToEnd            openLatin1 Close            regex  regex      x7F- xAF            c  regex Matches  contentLatin1  count         if   c -eq 0                 System IO File   WriteAllLines  i   contentLatin1   Utf8NoBomEncoding               i FullName                    else                openGB   New-Object System IO StreamReader -ArgumentList   i   Text Encoding   GetEncoding  GB18030                 contentGB    openGB ReadToEnd                openGB Close                System IO File   WriteAllLines  i   contentGB   Utf8NoBomEncoding               i FullName                   Write-Host -NoNewLine  Press any key to continue       null    Host UI RawUI ReadKey  NoEcho IncludeKeyDown

User · Answer

Check this   UDE  This is a port of Mozilla Universal Charset Detector and you can use it like this     public static void Main String   args        string filename   args 0       using  FileStream fs   File OpenRead filename             Ude CharsetDetector cdet   new Ude CharsetDetector            cdet Feed fs           cdet DataEnd            if  cdet Charset    null                Console WriteLine  Charset   0   confidence   1                      cdet Charset  cdet Confidence             else               Console WriteLine  Detection failed

User · Answer

Look here for c   https   msdn microsoft com en-us library system io streamreader currentencoding 28v vs 110 29 aspx  string path     path to your file ext    using  StreamReader sr   new StreamReader path  true         while  sr Peek    gt   0                Console Write  char sr Read                  Test for the encoding after reading  or at least       after the first read      Console WriteLine  The encoding used was  0     sr CurrentEncoding       Console ReadLine        Console WriteLine

User · Answer

The StreamReader CurrentEncoding property rarely returns the correct text file encoding for me  I ve had greater success determining a file s endianness  by analyzing its byte order mark  BOM   If the file does not have a BOM  this cannot determine the file s encoding    UPDATED 4 08 2020 to include UTF-32LE detection and return correct encoding for UTF-32BE       lt summary gt      Determines a text file s encoding by analyzing its byte order mark  BOM       Defaults to ASCII when detection of the text file s endianness fails       lt  summary gt       lt param name  filename  gt The text file to analyze  lt  param gt       lt returns gt The detected encoding  lt  returns gt  public static Encoding GetEncoding string filename           Read the BOM     var bom   new byte 4       using  var file   new FileStream filename  FileMode Open  FileAccess Read                 file Read bom  0  4                 Analyze the BOM     if  bom 0     0x2b  amp  amp  bom 1     0x2f  amp  amp  bom 2     0x76  return Encoding UTF7      if  bom 0     0xef  amp  amp  bom 1     0xbb  amp  amp  bom 2     0xbf  return Encoding UTF8      if  bom 0     0xff  amp  amp  bom 1     0xfe  amp  amp  bom 2     0  amp  amp  bom 3     0  return Encoding UTF32    UTF-32LE     if  bom 0     0xff  amp  amp  bom 1     0xfe  return Encoding Unicode    UTF-16LE     if  bom 0     0xfe  amp  amp  bom 1     0xff  return Encoding BigEndianUnicode    UTF-16BE     if  bom 0     0  amp  amp  bom 1     0  amp  amp  bom 2     0xfe  amp  amp  bom 3     0xff  return new UTF32Encoding true  true      UTF-32BE         We actually have no idea what the encoding is if we reach this point  so        you may wish to return null instead of defaulting to ASCII     return Encoding ASCII

User · Answer

I d try the following steps   1  Check if there is a Byte Order Mark  2  Check if the file is valid UTF8  3  Use the local  ANSI  codepage  ANSI as Microsoft defines it   Step 2 works because most non ASCII sequences in codepages other that UTF8 are not valid UTF8

User · Answer

Providing the implementation details for the steps proposed by  CodesInChaos   1  Check if there is a Byte Order Mark  2  Check if the file is valid UTF8  3  Use the local  ANSI  codepage  ANSI as Microsoft defines it   Step 2 works because most non ASCII sequences in codepages other that UTF8 are not valid UTF8  https   stackoverflow com a 4522251 867248 explains the tactic in more details   using System  using System IO  using System Text      Using encoding from BOM or UTF8 if no BOM found     check if the file is valid  by reading all lines    If decoding fails  use the local  ANSI  codepage  public string DetectFileEncoding Stream fileStream        var Utf8EncodingVerifier   Encoding GetEncoding  utf-8   new EncoderExceptionFallback    new DecoderExceptionFallback         using  var reader   new StreamReader fileStream  Utf8EncodingVerifier             detectEncodingFromByteOrderMarks  true  leaveOpen  true  bufferSize  1024                 string detectedEncoding          try                       while   reader EndOfStream                                var line   reader ReadLine                              detectedEncoding   reader CurrentEncoding BodyName                    catch  Exception e                           Failed to decode the file using the BOM UT8                  Assume it s local ANSI             detectedEncoding    ISO-8859-1                        Rewind the stream         fileStream Seek 0  SeekOrigin Begin           return detectedEncoding            Test  public void Test1         Stream fs   File OpenRead     TestData TextFile ansi csv        var detectedEncoding   DetectFileEncoding fs        using  var reader   new StreamReader fs  Encoding GetEncoding detectedEncoding                    Consume your file         var line   reader ReadLine

User · Answer

The following code works fine for me  using the StreamReader class     using  var reader   new StreamReader fileName  defaultEncodingIfNoBom  true             reader Peek       you need this        var encoding   reader CurrentEncoding        The trick is to use the Peek call  otherwise   NET has not done anything  and it hasn t read the preamble  the BOM   Of course  if you use any other ReadXXX call before checking the encoding  it works too   If the file has no BOM  then the defaultEncodingIfNoBom encoding will be used  There is also a StreamReader without this overload method  in this case  the Default  ANSI  encoding will be used as defaultEncodingIfNoBom   but I recommand to define what you consider the default encoding in your context   I have tested this successfully with files with BOM for UTF8  UTF16 Unicode  LE  amp  BE  and UTF32  LE  amp  BE   It does not work for UTF7

[c#] Effective way to find any file's Encoding

Examples related to c#

Examples related to encoding