Determine a string s encoding in C

Question

Is there any way to determine a string s encoding in C    Say  I have a filename string  but I don t know if it is encoded in Unicode UTF-16 or the system-default encoding  how do I find out

User · Answer

The SimpleHelpers FileEncoding Nuget package wraps a C  port of the Mozilla Universal Charset Detector into a dead-simple API   var encoding   FileEncoding DetectFileEncoding txtFile

User · Answer

My finally working approach is to try potential candidates of expected encodings by detecting invalid characters in the strings created from the byte array by the encodings  If I don t encounter invalid characters  I suppose the tested encoding works fine for the tested data  For me  having only Latin and German special characters to consider  in order to determine the proper encoding for a byte array  I try to detect invalid characters in a string with this method           lt summary gt          detect invalid characters in string  use to detect improper encoding          lt  summary gt           lt param name  quot s quot  gt  lt  param gt           lt returns gt  lt  returns gt      public static bool DetectInvalidChars string s                const string specialChars    quot  r n t     -    quot          amp             lt  gt                       quot           return s Any ch   gt                 specialChars Contains ch                  ch  gt    0   amp  amp  ch  lt    9                   ch  gt    a   amp  amp  ch  lt    z                   ch  gt    A   amp  amp  ch  lt    Z              NB  if you have other Latin-based languages to consider  you might want to adapt the specialChars const string in the code  Then I use it like this  I only expect UTF-8 or Default encoding              determine encoding by detecting invalid characters in string         var invoiceXmlText   Encoding UTF8 GetString invoiceXmlBytes      try utf-8 first         if  StringFuncs DetectInvalidChars invoiceXmlText               invoiceXmlText   Encoding Default GetString invoiceXmlBytes      fallback to default

User · Answer

I know this is a bit late - but to be clear   A string doesn t really have encoding    in  NET the a string is a collection of char objects  Essentially  if it is a string  it has already been decoded   However if you are reading the contents of a file  which is made of bytes  and wish to convert that to a string  then the file s encoding must be used    NET includes encoding and decoding classes for  ASCII  UTF7  UTF8  UTF32 and more   Most of these encodings contain certain byte-order marks that can be used to distinguish which encoding type was used   The  NET class System IO StreamReader is able to determine the encoding used within a stream  by reading those byte-order marks   Here is an example            lt summary gt          return the detected encoding and the contents of the file           lt  summary gt           lt param name  fileName  gt  lt  param gt           lt param name  contents  gt  lt  param gt           lt returns gt  lt  returns gt      public static Encoding DetectEncoding String fileName  out String contents                   open the file with the stream-reader          using  StreamReader reader   new StreamReader fileName  true                            read the contents of the file into a string             contents   reader ReadToEnd                    return the encoding              return reader CurrentEncoding

User · Answer

I found new library on GitHub  CharsetDetector UTF-unknown     Charset detector build in C  -  NET Core 2-3   NET standard 1-2  amp   NET 4    it s also a port of the Mozilla Universal Charset Detector based on other repositories   CharsetDetector UTF-unknown have a class named CharsetDetector   CharsetDetector contains some static encoding detect methods    CharsetDetector DetectFromFile   CharsetDetector DetectFromStream   CharsetDetector DetectFromBytes     detected result is in class DetectionResult has attribute Detected which is instance of class DetectionDetail with below attributes    EncodingName Encoding Confidence   below is an example to show usage      Program cs using System  using System Text  using UtfUnknown   namespace ConsoleExample       public class Program               public static void Main string   args                        string filename     E  new-file txt               DetectDemo filename                           lt summary gt              Command line example  detect the encoding of the given file               lt  summary gt               lt param name  filename  gt a filename lt  param gt          public static void DetectDemo string filename                           Detect from File             DetectionResult result   CharsetDetector DetectFromFile filename                  Get the best Detection             DetectionDetail resultDetected   result Detected                  detected result may be null              if  resultDetected    null                                   Get the alias of the found encoding                 string encodingName   resultDetected EncodingName                     Get the System Text Encoding of the found encoding  can be null if not available                  Encoding encoding   resultDetected Encoding                     Get the confidence of the found encoding  between 0 and 1                  float confidence   resultDetected Confidence                   if  encoding    null                                        Console WriteLine   Detection completed   filename                         Console WriteLine   EncodingWebName   encoding WebName  Environment NewLine Confidence   confidence                                       else                                       Console WriteLine   Detection completed   filename                         Console WriteLine    Encoding is null  Environment NewLine EncodingName   encodingName  Environment NewLine Confidence   confidence                                                 else                               Console WriteLine   Detection failed   filename                                       example result screenshot

User · Answer

Check out Utf8Checker it is simple class that does exactly this in pure managed code  http   utf8checker codeplex com  Notice  as already pointed out  determine encoding  makes sense only for byte streams  If you have a string it is already encoded from someone along the way who already knew or guessed the encoding to get the string in the first place

User · Answer

The code below has the following features    Detection or attempted detection of UTF-7  UTF-8 16 32  bom  no bom  little  amp  big endian  Falls back to the local default codepage if no Unicode encoding was found  Detects  with high probability  unicode files with the BOM signature missing Searches for charset xyz and encoding xyz inside file to help determine encoding  To save processing  you can  taste  the file  definable number of bytes   The encoding and decoded text file is returned  Purely byte-based solution for efficiency   As others have said  no solution can be perfect  and certainly one can t easily differentiate between the various 8-bit extended ASCII encodings in use worldwide   but we can get  good enough  especially if the developer also presents to the user a list of alternative encodings as shown here  What is the most common encoding of each language   A full list of Encodings can be found using Encoding GetEncodings        Function to detect the encoding for UTF-7  UTF-8 16 32  bom  no bom  little     amp  big endian   and local default codepage  and potentially other codepages      taster    number of bytes to check of the file  to save processing   Higher    value is slower  but more reliable  especially UTF-8 with special characters    later on may appear to be ASCII initially   If taster   0  then taster    becomes the length of the file  for maximum reliability    text  is simply    the string with the discovered encoding applied to the file  public Encoding detectTextEncoding string filename  out String text  int taster   1000        byte   b   File ReadAllBytes filename                         First check the low hanging fruit by checking if a                      BOM signature exists  sourced from http   www unicode org faq utf bom html bom4      if  b Length  gt   4  amp  amp  b 0     0x00  amp  amp  b 1     0x00  amp  amp  b 2     0xFE  amp  amp  b 3     0xFF    text   Encoding GetEncoding  utf-32BE   GetString b  4  b Length - 4   return Encoding GetEncoding  utf-32BE          UTF-32  big-endian      else if  b Length  gt   4  amp  amp  b 0     0xFF  amp  amp  b 1     0xFE  amp  amp  b 2     0x00  amp  amp  b 3     0x00    text   Encoding UTF32 GetString b  4  b Length - 4   return Encoding UTF32          UTF-32  little-endian     else if  b Length  gt   2  amp  amp  b 0     0xFE  amp  amp  b 1     0xFF    text   Encoding BigEndianUnicode GetString b  2  b Length - 2   return Encoding BigEndianUnicode           UTF-16  big-endian     else if  b Length  gt   2  amp  amp  b 0     0xFF  amp  amp  b 1     0xFE    text   Encoding Unicode GetString b  2  b Length - 2   return Encoding Unicode                    UTF-16  little-endian     else if  b Length  gt   3  amp  amp  b 0     0xEF  amp  amp  b 1     0xBB  amp  amp  b 2     0xBF    text   Encoding UTF8 GetString b  3  b Length - 3   return Encoding UTF8       UTF-8     else if  b Length  gt   3  amp  amp  b 0     0x2b  amp  amp  b 1     0x2f  amp  amp  b 2     0x76    text   Encoding UTF7 GetString b 3 b Length-3   return Encoding UTF7       UTF-7                    If the code reaches here  no BOM signature was found  so now                  we need to  taste  the file to see if can manually discover                  the encoding  A high taster value is desired for UTF-8     if  taster    0    taster  gt  b Length  taster   b Length        Taster size can t be bigger than the filesize obviously           Some text files are encoded in UTF8  but have no BOM signature  Hence        the below manually checks for a UTF8 pattern  This code is based off        the top answer at  https   stackoverflow com questions 6555015 check-for-invalid-utf8        For our purposes  an unnecessarily strict  and terser slower         implementation is shown at  https   stackoverflow com questions 1031645 how-to-detect-utf-8-in-plain-c        For the below  false positives should be exceedingly rare  and would        be either slightly malformed UTF-8  which would suit our purposes        anyway  or 8-bit extended ASCII UTF-16 32 at a vanishingly long shot       int i   0      bool utf8   false      while  i  lt  taster - 4                if  b i   lt   0x7F    i    1  continue           If all characters are below 0x80  then it is valid UTF8  but UTF8 is not  required   and therefore the text is more desirable to be treated as the default codepage of the computer   Hence  there s no  utf8   true   code unlike the next three checks          if  b i   gt   0xC2  amp  amp  b i   lt   0xDF  amp  amp  b i   1   gt   0x80  amp  amp  b i   1   lt  0xC0    i    2  utf8   true  continue            if  b i   gt   0xE0  amp  amp  b i   lt   0xF0  amp  amp  b i   1   gt   0x80  amp  amp  b i   1   lt  0xC0  amp  amp  b i   2   gt   0x80  amp  amp  b i   2   lt  0xC0    i    3  utf8   true  continue            if  b i   gt   0xF0  amp  amp  b i   lt   0xF4  amp  amp  b i   1   gt   0x80  amp  amp  b i   1   lt  0xC0  amp  amp  b i   2   gt   0x80  amp  amp  b i   2   lt  0xC0  amp  amp  b i   3   gt   0x80  amp  amp  b i   3   lt  0xC0    i    4  utf8   true  continue            utf8   false  break            if  utf8    true            text   Encoding UTF8 GetString b           return Encoding UTF8                 The next check is a heuristic attempt to detect UTF-16 without a BOM         We simply look for zeroes in odd or even byte places  and if a certain        threshold is reached  the code is  probably  UF-16                double threshold   0 1     proportion of chars step 2 which must be zeroed to be diagnosed as utf-16  0 1   10      int count   0      for  int n   0  n  lt  taster  n    2  if  b n     0  count        if    double count    taster  gt  threshold    text   Encoding BigEndianUnicode GetString b   return Encoding BigEndianUnicode        count   0      for  int n   1  n  lt  taster  n    2  if  b n     0  count        if    double count    taster  gt  threshold    text   Encoding Unicode GetString b   return Encoding Unicode        little-endian           Finally  a long shot - let s see if we can find  charset xyz  or         encoding xyz  to identify the encoding      for  int n   0  n  lt  taster-9  n                  if                 b n   0      c     b n   0      C    amp  amp   b n   1      h     b n   1      H    amp  amp   b n   2      a     b n   2      A    amp  amp   b n   3      r     b n   3      R    amp  amp   b n   4      s     b n   4      S    amp  amp   b n   5      e     b n   5      E    amp  amp   b n   6      t     b n   6      T    amp  amp   b n   7                            b n   0      e     b n   0      E    amp  amp   b n   1      n     b n   1      N    amp  amp   b n   2      c     b n   2      C    amp  amp   b n   3      o     b n   3      O    amp  amp   b n   4      d     b n   4      D    amp  amp   b n   5      i     b n   5      I    amp  amp   b n   6      n     b n   6      N    amp  amp   b n   7      g     b n   7      G    amp  amp   b n   8                                               if  b n   0      c     b n   0      C   n    8  else n    9              if  b n            b n           n                int oldn   n              while  n  lt  taster  amp  amp   b n            b n      -      b n   gt    0   amp  amp  b n   lt    9       b n   gt    a   amp  amp  b n   lt    z       b n   gt    A   amp  amp  b n   lt    Z                   n                  byte   nb   new byte n-oldn               Array Copy b  oldn  nb  0  n-oldn               try                   string internalEnc   Encoding ASCII GetString nb                   text   Encoding GetEncoding internalEnc  GetString b                   return Encoding GetEncoding internalEnc                             catch   break          If C  doesn t recognize the name of the encoding  break                           If all else fails  the encoding is probably  though certainly not        definitely  the user s local codepage  One might present to the user a        list of alternative encodings as shown here  https   stackoverflow com questions 8509339 what-is-the-most-common-encoding-of-each-language        A full list can be found using Encoding GetEncodings        text   Encoding Default GetString b       return Encoding Default

User · Answer

It depends where the string  came from    A  NET string is Unicode  UTF-16    The only way it could be different if you  say  read the data from a database into a byte array   This CodeProject article might be of interest  Detect Encoding for in- and outgoing text  Jon Skeet s Strings in C  and  NET is an excellent explanation of  NET strings

User · Answer

Another option  very late in coming  sorry   http   www architectshack com TextFileEncodingDetector ashx  This small C -only class uses BOMS if present  tries to auto-detect possible unicode encodings otherwise  and falls back if none of the Unicode encodings is possible or likely   It sounds like UTF8Checker referenced above does something similar  but I think this is slightly broader in scope - instead of just UTF8  it also checks for other possible Unicode encodings  UTF-16 LE or BE  that might be missing a BOM   Hope this helps someone

User · Answer

My solution is to use built-in stuffs with some fallbacks   I picked the strategy from an answer to another similar question on stackoverflow but I can t find it now   It checks the BOM first using the built-in logic in StreamReader  if there s BOM  the encoding will be something other than Encoding Default  and we should trust that result   If not  it checks whether the bytes sequence is valid UTF-8 sequence  if it is  it will guess UTF-8 as the encoding  and if not  again  the default ASCII encoding will be the result   static Encoding getEncoding string path        var stream   new FileStream path  FileMode Open       var reader   new StreamReader stream  Encoding Default  true       reader Read         if  reader CurrentEncoding    Encoding Default            reader Close            return reader CurrentEncoding             stream Position   0       reader   new StreamReader stream  new UTF8Encoding false  true        try           reader ReadToEnd            reader Close            return Encoding UTF8            catch  Exception            reader Close            return Encoding Default

User · Answer

Note  this was an experiment to see how UTF-8 encoding worked internally  The solution offered by vilicvane  to use a UTF8Encoding object that is initialised to throw an exception on decoding failure  is much simpler  and basically does the same thing     I wrote this piece of code to differentiate between UTF-8 and Windows-1252  It shouldn t be used for gigantic text files though  since it loads the entire thing into memory and scans it completely  I used it for  srt subtitle files  just to be able to save them back in the encoding in which they were loaded   The encoding given to the function as ref should be the 8-bit fallback encoding to use in case the file is detected as not being valid UTF-8  generally  on Windows systems  this will be Windows-1252  This doesn t do anything fancy like checking actual valid ascii ranges though  and doesn t detect UTF-16 even on byte order mark   The theory behind the bitwise detection can be found here  https   ianthehenry com 2015 1 17 decoding-utf-8   Basically  the bit range of the first byte determines how many after it are part of the UTF-8 entity  These bytes after it are always in the same bit range        lt summary gt      Reads a text file  and detects whether its encoding is valid UTF-8 or ascii      If not  decodes the text using the given fallback encoding      Bit-wise mechanism for detecting valid UTF-8 based on     https   ianthehenry com 2015 1 17 decoding-utf-8       lt  summary gt       lt param name  docBytes  gt The bytes read from the file  lt  param gt       lt param name  encoding  gt The default encoding to use as fallback if the text is detected not to be pure ascii or UTF-8 compliant  This ref parameter is changed to the detected encoding  lt  param gt       lt returns gt The contents of the read file  as String  lt  returns gt  public static String ReadFileAndGetEncoding Byte   docBytes  ref Encoding encoding        if  encoding    null          encoding   Encoding GetEncoding 1252       Int32 len   docBytes Length         byte order mark for utf-8  Easiest way of detecting encoding      if  len  gt  3  amp  amp  docBytes 0     0xEF  amp  amp  docBytes 1     0xBB  amp  amp  docBytes 2     0xBF                encoding   new UTF8Encoding true              Note that even when initialising an encoding to have            a BOM  it does not cut it off the front of the input          return encoding GetString docBytes  3  len - 3             Boolean isPureAscii   true      Boolean isUtf8Valid   true      for  Int32 i   0  i  lt  len    i                Int32 skip   TestUtf8 docBytes  i           if  skip    0              continue          if  isPureAscii              isPureAscii   false          if  skip  lt  0                        isUtf8Valid   false                 if invalid utf8 is detected  there s no sense in going on              break                    i    skip            if  isPureAscii          encoding   new ASCIIEncoding       pure 7-bit ascii      else if  isUtf8Valid          encoding   new UTF8Encoding false          else  retain given encoding  This should be an 8-bit encoding like Windows-1252      return encoding GetString docBytes           lt summary gt      Tests if the bytes following the given offset are UTF-8 valid  and     returns the amount of bytes to skip ahead to do the next read if it is      If the text is not UTF-8 valid it returns -1       lt  summary gt       lt param name  binFile  gt Byte array to test lt  param gt       lt param name  offset  gt Offset in the byte array to test  lt  param gt       lt returns gt The amount of bytes to skip ahead for the next read  or -1 if the byte sequence wasn t valid UTF-8 lt  returns gt  public static Int32 TestUtf8 Byte   binFile  Int32 offset           7 bytes  so 6 added bytes  is the maximum the UTF-8 design could support         but in reality it only goes up to 3  meaning the full amount is 4      const Int32 maxUtf8Length   4      Byte current   binFile offset       if   current  amp  0x80     0          return 0     valid 7-bit ascii  Added length is 0 bytes      Int32 len   binFile Length      for  Int32 addedlength   1  addedlength  lt  maxUtf8Length    addedlength                Int32 fullmask   0x80          Int32 testmask   0             This code adds shifted bits to get the desired full mask             If the full mask is  111 0 0000  then test mask will be  110 0 0000  Since this is            effectively always the previous step in the iteration I just store it each time          for  Int32 i   0  i  lt   addedlength    i                        testmask   fullmask              fullmask     0x80  gt  gt   i 1                         figure out bit masks from level         if   current  amp  fullmask     testmask                        if  offset   addedlength  gt   len                  return -1                 Lookahead  Pattern of any following bytes is always 10xxxxxx             for  Int32 i   1  i  lt   addedlength    i                                if   binFile offset   i   amp  0xC0     0x80                      return -1                            return addedlength                         Value is greater than the maximum allowed for utf8  Deemed invalid      return -1

[c#] Determine a string's encoding in C#

Examples related to c#

Examples related to string

Examples related to encoding