How to detect the character encoding of a text file

Question

I try to detect which character encoding is used in my file   I try with this code to get the standard encoding   public static Encoding GetFileEncoding string srcFile                     Use Default of Encoding Default  Ansi CodePage        Encoding enc   Encoding Default                Detect byte order mark if any - otherwise assume default       byte   buffer   new byte 5         FileStream file   new FileStream srcFile  FileMode Open         file Read buffer  0  5         file Close           if  buffer 0     0xef  amp  amp  buffer 1     0xbb  amp  amp  buffer 2     0xbf          enc   Encoding UTF8        else if  buffer 0     0xfe  amp  amp  buffer 1     0xff          enc   Encoding Unicode        else if  buffer 0     0  amp  amp  buffer 1     0  amp  amp  buffer 2     0xfe  amp  amp  buffer 3     0xff          enc   Encoding UTF32        else if  buffer 0     0x2b  amp  amp  buffer 1     0x2f  amp  amp  buffer 2     0x76          enc   Encoding UTF7        else if  buffer 0     0xFE  amp  amp  buffer 1     0xFF                   1201 unicodeFFFE Unicode  Big-Endian          enc   Encoding GetEncoding 1201               else if  buffer 0     0xFF  amp  amp  buffer 1     0xFE                   1200 utf-16 Unicode         enc   Encoding GetEncoding 1200           return enc          My five first byte are 60  118  56  46 and 49   Is there a chart that shows which encoding matches those five first bytes

User · Answer

You should read this  How can I detect the encoding codepage of a text file

User · Answer

You can t depend on the file having a BOM   UTF-8 doesn t require it   And non-Unicode encodings don t even have a BOM   There are  however  other ways to detect the encoding   UTF-32  BOM is 00 00 FE FF  for BE  or FF FE 00 00  for LE    But UTF-32 is easy to detect even without a BOM   This is because the Unicode code point range is restricted to U 10FFFF  and thus UTF-32 units always have the pattern 00  00-10  xx xx  for BE  or xx xx  00-10  00  for LE    If the data has a length that s a multiple of 4  and follows one of these patterns  you can safely assume it s UTF-32   False positives are nearly impossible due to the rarity of 00 bytes in byte-oriented encodings   US-ASCII  No BOM  but you don t need one   ASCII can be easily identified by the lack of bytes in the 80-FF range   UTF-8  BOM is EF BB BF   But you can t rely on this   Lots of UTF-8 files don t have a BOM  especially if they originated on non-Windows systems   But you can safely assume that if a file validates as UTF-8  it is UTF-8   False positives are rare   Specifically  given that the data is not ASCII  the false positive rate for a 2-byte sequence is only 3 9   1920 49152    For a 7-byte sequence  it s less than 1    For a 12-byte sequence  it s less than 0 1    For a 24-byte sequence  it s less than 1 in a million   UTF-16  BOM is FE FF  for BE  or FF FE  for LE    Note that the UTF-16LE BOM is found at the start of the UTF-32LE BOM  so check UTF-32 first   If you happen to have a file that consists mainly of ISO-8859-1 characters  having half of the file s bytes be 00 would also be a strong indicator of UTF-16   Otherwise  the only reliable way to recognize UTF-16 without a BOM is to look for surrogate pairs  D 8-B xx D C-F xx   but non-BMP characters are too rarely-used to make this approach practical   XML  If your file starts with the bytes 3C 3F 78 6D 6C  i e   the ASCII characters   lt  xml    then look for an encoding  declaration   If present  then use that encoding   If absent  then assume UTF-8  which is the default XML encoding   If you need to support EBCDIC  also look for the equivalent sequence 4C 6F A7 94 93   In general  if you have a file format that contains an encoding declaration  then look for that declaration rather than trying to guess the encoding   None of the above  There are hundreds of other encodings  which require more effort to detect   I recommend trying Mozilla s charset detector or a  NET port of it   A reasonable default  If you ve ruled out the UTF encodings  and don t have an encoding declaration or statistical detection that points to a different encoding  assume ISO-8859-1 or the closely related Windows-1252    Note that the latest HTML standard requires a    ISO-8859-1    declaration to be interpreted as Windows-1252    Being Windows  default code page for English  and other popular languages like Spanish  Portuguese  German  and French   it s the most commonly encountered encoding other than UTF-8

User · Answer

Several answers are here but nobody has posted usefull code   Here is my code that detects all encodings that Microsoft detects in Framework 4 in the StreamReader class   Obviously you must call this function immediately after opening the stream before reading anything else from the stream because the BOM are the first bytes in the stream   This function requires a Stream that can seek  for example a FileStream   If you have a Stream that cannot seek you must write a more complicated code that returns a Byte buffer with the bytes that have already been read but that are not BOM        lt summary gt      UTF8      EF BB BF     UTF16 BE  FE FF     UTF16 LE  FF FE     UTF32 BE  00 00 FE FF     UTF32 LE  FF FE 00 00      lt  summary gt  public static Encoding DetectEncoding Stream i Stream        if   i Stream CanSeek     i Stream CanRead          throw new Exception  DetectEncoding   requires a seekable and readable Stream            Try to read 4 bytes  If the stream is shorter  less bytes will be read      Byte   u8 Buf   new Byte 4       int s32 Count   i Stream Read u8 Buf  0  4       if  s32 Count  gt   2                if  u8 Buf 0     0xFE  amp  amp  u8 Buf 1     0xFF                        i Stream Position   2              return new UnicodeEncoding true  true                      if  u8 Buf 0     0xFF  amp  amp  u8 Buf 1     0xFE                        if  s32 Count  gt   4  amp  amp  u8 Buf 2     0  amp  amp  u8 Buf 3     0                                i Stream Position   4                  return new UTF32Encoding false  true                             else                               i Stream Position   2                  return new UnicodeEncoding false  true                                    if  s32 Count  gt   3  amp  amp  u8 Buf 0     0xEF  amp  amp  u8 Buf 1     0xBB  amp  amp  u8 Buf 2     0xBF                        i Stream Position   3              return Encoding UTF8                     if  s32 Count  gt   4  amp  amp  u8 Buf 0     0  amp  amp  u8 Buf 1     0  amp  amp  u8 Buf 2     0xFE  amp  amp  u8 Buf 3     0xFF                        i Stream Position   4              return new UTF32Encoding true  true                        i Stream Position   0      return Encoding Default

User · Answer

If your file starts with the bytes 60  118  56  46 and 49  then you have an ambiguous case  It could be UTF-8  without BOM  or any of the single byte encodings like ASCII  ANSI  ISO-8859-1 etc

User · Answer

Yes  there is one here  http   en wikipedia org wiki Byte order mark Representations of byte order marks by encoding

User · Answer

If you want to pursue a  simple  solution  you might find this class I put together useful   http   www architectshack com TextFileEncodingDetector ashx  It does the BOM detection automatically first  and then tries to differentiate between Unicode encodings without BOM  vs some other default encoding  generally Windows-1252  incorrectly labelled as Encoding ASCII in  Net    As noted above  a  heavier  solution involving NCharDet or MLang may be more appropriate  and as I note on the overview page of this class  the best is to provide some form of interactivity with the user if at all possible  because there simply is no 100  detection rate possible   Snippet in case the site is offline   using System  using System Text  using System Text RegularExpressions  using System IO   namespace KlerksSoft       public static class TextFileEncodingDetector                             Simple class to handle text file encoding woes  in a primarily English-speaking tech                  world                           - This code is fully managed  no shady calls to MLang  the unmanaged codepage                 detection library originally developed for Internet Explorer                           - This class does NOT try to detect arbitrary codepages charsets  it really only                 aims to differentiate between some of the most common variants of Unicode                  encoding  and a  default   western   ascii-based  encoding alternative provided                 by the caller                               - As there is no  Reliable  way to distinguish between UTF-8  without BOM  and                  Windows-1252  in  Net  also incorrectly called  ASCII   encodings  we use a                  heuristic - so the more of the file we can sample the better the guess  If you                  are going to read the whole file into memory at some point  then best to pass                  in the whole byte byte array directly  Otherwise  decide how to trade off                  reliability against performance   memory usage                               - The UTF-8 detection heuristic only works for western text  as it relies on                  the presence of UTF-8 encoded accented and other characters found in the upper                  ranges of the Latin-1 and  particularly  Windows-1252 codepages                           - For more general detection routines  see existing projects   resources                - MLang - Microsoft library originally for IE6  available in Windows XP and later APIs now  I think                   - MLang  Net bindings  http   www codeproject com KB recipes DetectEncoding aspx               - CharDet - Mozilla browser s detection routines                 - Ported to Java then  Net  http   www conceptdevelopment net Localization NCharDet                  - Ported straight to  Net  http   code google com p chardetsharp source browse                         Copyright Tao Klerks  2010-2012  tao klerks biz            Licensed under the modified BSD license              Redistribution and use in source and binary forms  with or without modification  are  permitted provided that the following conditions are met   - Redistributions of source code must retain the above copyright notice  this list of  conditions and the following disclaimer   - Redistributions in binary form must reproduce the above copyright notice  this list  of conditions and the following disclaimer in the documentation and or other materials provided with the distribution   - The name of the author may not be used to endorse or promote products derived from  this software without specific prior written permission  THIS SOFTWARE IS PROVIDED BY THE AUTHOR   AS IS   AND ANY EXPRESS OR IMPLIED WARRANTIES   INCLUDING  BUT NOT LIMITED TO  THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR  A PARTICULAR PURPOSE ARE DISCLAIMED  IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY  DIRECT  INDIRECT  INCIDENTAL  SPECIAL  EXEMPLARY  OR CONSEQUENTIAL DAMAGES  INCLUDING   BUT NOT LIMITED TO  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES  LOSS OF USE  DATA  OR  PROFITS  OR BUSINESS INTERRUPTION  HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY   WHETHER IN CONTRACT  STRICT LIABILITY  OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE   ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE  EVEN IF ADVISED OF THE POSSIBILITY  OF SUCH DAMAGE                         CHANGELOG              - 2012-02-03                 - Simpler methods  removing the silly  DefaultEncoding  parameter  with      operator  saves no typing                - More complete methods                 - Optionally return indication of whether BOM was found in  Detect  methods                 - Provide straight-to-string method for byte arrays  GetStringFromByteArray                       const long  defaultHeuristicSampleSize   0x10000    completely arbitrary - inappropriate for high numbers of files   high speed requirements          public static Encoding DetectTextFileEncoding string InputFilename                        using  FileStream textfileStream   File OpenRead InputFilename                                 return DetectTextFileEncoding textfileStream   defaultHeuristicSampleSize                                    public static Encoding DetectTextFileEncoding FileStream InputFileStream  long HeuristicSampleSize                        bool uselessBool   false              return DetectTextFileEncoding InputFileStream   defaultHeuristicSampleSize  out uselessBool                      public static Encoding DetectTextFileEncoding FileStream InputFileStream  long HeuristicSampleSize  out bool HasBOM                        if  InputFileStream    null                  throw new ArgumentNullException  Must provide a valid Filestream     InputFileStream                 if   InputFileStream CanRead                  throw new ArgumentException  Provided file stream is not readable     InputFileStream                 if   InputFileStream CanSeek                  throw new ArgumentException  Provided file stream cannot seek     InputFileStream                 Encoding encodingFound   null               long originalPos   InputFileStream Position               InputFileStream Position   0                  First read only what we need for BOM detection             byte   bomBytes   new byte InputFileStream Length  gt  4   4   InputFileStream Length               InputFileStream Read bomBytes  0  bomBytes Length                encodingFound   DetectBOMBytes bomBytes                if  encodingFound    null                                InputFileStream Position   originalPos                  HasBOM   true                  return encodingFound                                BOM Detection failed  going for heuristics now                  create sample byte array and populate it             byte   sampleBytes   new byte HeuristicSampleSize  gt  InputFileStream Length   InputFileStream Length   HeuristicSampleSize               Array Copy bomBytes  sampleBytes  bomBytes Length               if  InputFileStream Length  gt  bomBytes Length                  InputFileStream Read sampleBytes  bomBytes Length  sampleBytes Length - bomBytes Length               InputFileStream Position   originalPos                 test byte array content             encodingFound   DetectUnicodeInByteSampleByHeuristics sampleBytes                HasBOM   false              return encodingFound                     public static Encoding DetectTextByteArrayEncoding byte   TextData                        bool uselessBool   false              return DetectTextByteArrayEncoding TextData  out uselessBool                      public static Encoding DetectTextByteArrayEncoding byte   TextData  out bool HasBOM                        if  TextData    null                  throw new ArgumentNullException  Must provide a valid text data byte array     TextData                 Encoding encodingFound   null               encodingFound   DetectBOMBytes TextData                if  encodingFound    null                                HasBOM   true                  return encodingFound                            else                                 test byte array content                 encodingFound   DetectUnicodeInByteSampleByHeuristics TextData                    HasBOM   false                  return encodingFound                                   public static string GetStringFromByteArray byte   TextData  Encoding DefaultEncoding                        return GetStringFromByteArray TextData  DefaultEncoding   defaultHeuristicSampleSize                      public static string GetStringFromByteArray byte   TextData  Encoding DefaultEncoding  long MaxHeuristicSampleSize                        if  TextData    null                  throw new ArgumentNullException  Must provide a valid text data byte array     TextData                 Encoding encodingFound   null               encodingFound   DetectBOMBytes TextData                if  encodingFound    null                                  For some reason  the default encodings don t detect swallow their own preambles                   return encodingFound GetString TextData  encodingFound GetPreamble   Length  TextData Length - encodingFound GetPreamble   Length                             else                               byte   heuristicSample   null                  if  TextData Length  gt  MaxHeuristicSampleSize                                        heuristicSample   new byte MaxHeuristicSampleSize                       Array Copy TextData  heuristicSample  MaxHeuristicSampleSize                                     else                                       heuristicSample   TextData                                     encodingFound   DetectUnicodeInByteSampleByHeuristics TextData     DefaultEncoding                  return encodingFound GetString TextData                                     public static Encoding DetectBOMBytes byte   BOMBytes                        if  BOMBytes    null                  throw new ArgumentNullException  Must provide a valid BOM byte array     BOMBytes                 if  BOMBytes Length  lt  2                  return null               if  BOMBytes 0     0xff                   amp  amp  BOMBytes 1     0xfe                   amp  amp   BOMBytes Length  lt  4                         BOMBytes 2     0                         BOMBytes 3     0                                                         return Encoding Unicode               if  BOMBytes 0     0xfe                   amp  amp  BOMBytes 1     0xff                                   return Encoding BigEndianUnicode               if  BOMBytes Length  lt  3                  return null               if  BOMBytes 0     0xef  amp  amp  BOMBytes 1     0xbb  amp  amp  BOMBytes 2     0xbf                  return Encoding UTF8               if  BOMBytes 0     0x2b  amp  amp  BOMBytes 1     0x2f  amp  amp  BOMBytes 2     0x76                  return Encoding UTF7               if  BOMBytes Length  lt  4                  return null               if  BOMBytes 0     0xff  amp  amp  BOMBytes 1     0xfe  amp  amp  BOMBytes 2     0  amp  amp  BOMBytes 3     0                  return Encoding UTF32               if  BOMBytes 0     0  amp  amp  BOMBytes 1     0  amp  amp  BOMBytes 2     0xfe  amp  amp  BOMBytes 3     0xff                  return Encoding GetEncoding 12001                return null                     public static Encoding DetectUnicodeInByteSampleByHeuristics byte   SampleBytes                        long oddBinaryNullsInSample   0              long evenBinaryNullsInSample   0              long suspiciousUTF8SequenceCount   0              long suspiciousUTF8BytesTotal   0              long likelyUSASCIIBytesInSample   0                 Cycle through  keeping count of binary null positions  possible UTF-8                  sequences from upper ranges of Windows-1252  and probable US-ASCII                  character counts               long currentPos   0              int skipUTF8Bytes   0               while  currentPos  lt  SampleBytes Length                                  binary null distribution                 if  SampleBytes currentPos     0                                        if  currentPos   2    0                          evenBinaryNullsInSample                        else                         oddBinaryNullsInSample                                         likely US-ASCII characters                 if  IsCommonUSASCIIByte SampleBytes currentPos                        likelyUSASCIIBytesInSample                       suspicious sequences  look like UTF-8                  if  skipUTF8Bytes    0                                        int lengthFound   DetectSuspiciousUTF8SequenceLength SampleBytes  currentPos                        if  lengthFound  gt  0                                                suspiciousUTF8SequenceCount                            suspiciousUTF8BytesTotal    lengthFound                          skipUTF8Bytes   lengthFound - 1                                                          else                                       skipUTF8Bytes--                                     currentPos                                 1  UTF-16 LE - in english   european environments  this is usually characterized by a                  high proportion of odd binary nulls  starting at 0   with  as this is text  a low                  proportion of even binary nulls                  The thresholds here used  less than 20  nulls where you expect non-nulls  and more than                 60  nulls where you do expect nulls  are completely arbitrary               if    evenBinaryNullsInSample   2 0    SampleBytes Length   lt  0 2                   amp  amp    oddBinaryNullsInSample   2 0    SampleBytes Length   gt  0 6                                   return Encoding Unicode                  2  UTF-16 BE - in english   european environments  this is usually characterized by a                  high proportion of even binary nulls  starting at 0   with  as this is text  a low                  proportion of odd binary nulls                  The thresholds here used  less than 20  nulls where you expect non-nulls  and more than                 60  nulls where you do expect nulls  are completely arbitrary               if    oddBinaryNullsInSample   2 0    SampleBytes Length   lt  0 2                   amp  amp    evenBinaryNullsInSample   2 0    SampleBytes Length   gt  0 6                                   return Encoding BigEndianUnicode                  3  UTF-8 - Martin D  rst outlines a method for detecting whether something CAN be UTF-8 content                  using regexp  in his w3c org unicode FAQ entry                   http   www w3 org International questions qa-forms-utf-8                 adapted here for C               string potentiallyMangledString   Encoding ASCII GetString SampleBytes               Regex UTF8Validator   new Regex    A                          x09 x0A x0D x20- x7E                          xC2- xDF   x80- xBF                         xE0  xA0- xBF   x80- xBF                          xE1- xEC xEE xEF   x80- xBF  2                         xED  x80- x9F   x80- xBF                         xF0  x90- xBF   x80- xBF  2                          xF1- xF3   x80- xBF  3                         xF4  x80- x8F   x80- xBF  2                          z                if  UTF8Validator IsMatch potentiallyMangledString                                   Unfortunately  just the fact that it CAN be UTF-8 doesn t tell you much about probabilities                    If all the characters are in the 0-127 range  no harm done  most western charsets are same as UTF-8 in these ranges                    If some of the characters were in the upper range  western accented characters   however  they would likely be mangled to 2-byte by the UTF-8 encoding process                     So  we need to play stats                      The  Random  likelihood of any pair of randomly generated characters being one                       of these  suspicious  character sequences is                         128    256   256    0 2                                         In western text data  that is SIGNIFICANTLY reduced - most text data stays in the  lt 127                       character range  so we assume that more than 1 in 500 000 of these character                       sequences indicates UTF-8  The number 500 000 is completely arbitrary - so sue me                                        We can only assume these character sequences will be rare if we ALSO assume that this                      IS in fact western text - in which case the bulk of the UTF-8 encoded data  that is                       not already suspicious sequences  should be plain US-ASCII bytes  This  I                       arbitrarily decided  should be 80   a random distribution  eg binary data  would yield                       approx 40   so the chances of hitting this threshold by accident in random data are                       VERY low                     if   suspiciousUTF8SequenceCount   500000 0   SampleBytes Length  gt   1    suspicious sequences                      amp  amp                                 all suspicious  so cannot evaluate proportion of US-Ascii                            SampleBytes Length - suspiciousUTF8BytesTotal    0                                                           likelyUSASCIIBytesInSample   1 0    SampleBytes Length - suspiciousUTF8BytesTotal   gt   0 8                                                                    return Encoding UTF8                             return null                     private static bool IsCommonUSASCIIByte byte testByte                        if  testByte    0x0A   lf                    testByte    0x0D   cr                    testByte    0x09   tab                     testByte  gt   0x20  amp  amp  testByte  lt   0x2F    common punctuation                     testByte  gt   0x30  amp  amp  testByte  lt   0x39    digits                     testByte  gt   0x3A  amp  amp  testByte  lt   0x40    common punctuation                     testByte  gt   0x41  amp  amp  testByte  lt   0x5A    capital letters                     testByte  gt   0x5B  amp  amp  testByte  lt   0x60    common punctuation                     testByte  gt   0x61  amp  amp  testByte  lt   0x7A    lowercase letters                     testByte  gt   0x7B  amp  amp  testByte  lt   0x7E    common punctuation                                   return true              else                 return false                     private static int DetectSuspiciousUTF8SequenceLength byte   SampleBytes  long currentPos                        int lengthFound   0               if  SampleBytes Length  gt   currentPos   1                   amp  amp  SampleBytes currentPos     0xC2                                                 if  SampleBytes currentPos   1     0x81                         SampleBytes currentPos   1     0x8D                         SampleBytes currentPos   1     0x8F                                           lengthFound   2                  else if  SampleBytes currentPos   1     0x90                         SampleBytes currentPos   1     0x9D                                           lengthFound   2                  else if  SampleBytes currentPos   1   gt   0xA0                       amp  amp  SampleBytes currentPos   1   lt   0xBF                                           lengthFound   2                            else if  SampleBytes Length  gt   currentPos   1                   amp  amp  SampleBytes currentPos     0xC3                                                 if  SampleBytes currentPos   1   gt   0x80                       amp  amp  SampleBytes currentPos   1   lt   0xBF                                           lengthFound   2                            else if  SampleBytes Length  gt   currentPos   1                   amp  amp  SampleBytes currentPos     0xC5                                                 if  SampleBytes currentPos   1     0x92                         SampleBytes currentPos   1     0x93                                           lengthFound   2                  else if  SampleBytes currentPos   1     0xA0                         SampleBytes currentPos   1     0xA1                                           lengthFound   2                  else if  SampleBytes currentPos   1     0xB8                         SampleBytes currentPos   1     0xBD                         SampleBytes currentPos   1     0xBE                                           lengthFound   2                            else if  SampleBytes Length  gt   currentPos   1                   amp  amp  SampleBytes currentPos     0xC6                                                 if  SampleBytes currentPos   1     0x92                      lengthFound   2                            else if  SampleBytes Length  gt   currentPos   1                   amp  amp  SampleBytes currentPos     0xCB                                                 if  SampleBytes currentPos   1     0x86                         SampleBytes currentPos   1     0x9C                                           lengthFound   2                            else if  SampleBytes Length  gt   currentPos   2                   amp  amp  SampleBytes currentPos     0xE2                                                 if  SampleBytes currentPos   1     0x80                                        if  SampleBytes currentPos   2     0x93                             SampleBytes currentPos   2     0x94                                                   lengthFound   3                      if  SampleBytes currentPos   2     0x98                             SampleBytes currentPos   2     0x99                             SampleBytes currentPos   2     0x9A                                                   lengthFound   3                      if  SampleBytes currentPos   2     0x9C                             SampleBytes currentPos   2     0x9D                             SampleBytes currentPos   2     0x9E                                                   lengthFound   3                      if  SampleBytes currentPos   2     0xA0                             SampleBytes currentPos   2     0xA1                             SampleBytes currentPos   2     0xA2                                                   lengthFound   3                      if  SampleBytes currentPos   2     0xA6                          lengthFound   3                      if  SampleBytes currentPos   2     0xB0                          lengthFound   3                      if  SampleBytes currentPos   2     0xB9                             SampleBytes currentPos   2     0xBA                                                   lengthFound   3                                    else if  SampleBytes currentPos   1     0x82                       amp  amp  SampleBytes currentPos   2     0xAC                                           lengthFound   3                  else if  SampleBytes currentPos   1     0x84                       amp  amp  SampleBytes currentPos   2     0xA2                                           lengthFound   3                             return lengthFound

User · Answer

I use Ude that is a C  port of Mozilla Universal Charset Detector  It is easy to use and gives some really good results

User · Answer

Use StreamReader and direct it to detect the encoding for you   using  var reader   new System IO StreamReader path  true         var currentEncoding   reader CurrentEncoding      And use Code Page Identifiers https   msdn microsoft com en-us library windows desktop dd317756 v vs 85  aspx in order to switch logic depending on it

[c#] How to detect the character encoding of a text file?

Examples related to c#

Examples related to encoding

Examples related to character-encoding

Examples related to byte-order-mark