How to convert a UTF-8 string into Unicode

Question

I have string that displays UTF-8 encoded characters  and I want to convert it back to Unicode   For now  my implementation is the following   public static string DecodeFromUtf8 this string utf8String           read the string as UTF-8 bytes      byte   encodedBytes   Encoding UTF8 GetBytes utf8String           convert them into unicode bytes      byte   unicodeBytes   Encoding Convert Encoding UTF8  Encoding Unicode  encodedBytes           builds the converted string      return Encoding Unicode GetString encodedBytes       I am playing with the word  d  j     I have converted it into UTF-8 through this online tool  and so I started to test my method with the string  d    j      Unfortunately  with this implementation the string just remains the same   Where am I wrong

User · Answer

If you have a UTF-8 string  where every byte is correct       -   195  0     150  0    you can use the following   public static string Utf8ToUtf16 string utf8String                                                                                Every  NET string will store text with the UTF-16 encoding           known as Encoding Unicode  Other encodings may exist as              Byte-Array or incorrectly stored with the UTF-16 encoding                                                                                 UTF-8   1 bytes per char                                                  100  for the ansi  d                                                 206  and  186  for the russian                                                                                                       UTF-16   2 bytes per char                                                 100  0  for the ansi  d                                              186  3  for the russian                                                                                                              UTF-8 inside UTF-16                                                       100  0  for the ansi  d                                              206  0  and  186  0  for the russian                                                                                                 First we need to get the UTF-8 Byte-Array and remove all             0 byte  binary 0  while doing so                                                                                                          Binary 0 means end of string on UTF-8 encoding while on              UTF-16 one binary 0 does not end the string  Only if there           are 2 binary 0  than the UTF-16 encoding will end the                string  Because of  NET we don t have to handle this                                                                                      After removing binary 0 and receiving the Byte-Array  we             can use the UTF-8 encoding to string method now to get a             UTF-16 string                                                                                                                                                                                                    Get UTF-8 bytes and remove binary 0 bytes  filler      List lt byte gt  utf8Bytes   new List lt byte gt  utf8String Length       foreach  byte utf8Byte in utf8String                   Remove binary 0 bytes  filler          if  utf8Byte  gt  0                utf8Bytes Add utf8Byte                           Convert UTF-8 bytes to UTF-16 string     return Encoding UTF8 GetString utf8Bytes ToArray         In my case the DLL result is a UTF-8 string too  but unfortunately the UTF-8 string is interpreted with UTF-16 encoding       -   195  0    19  32    So the ANSI       which is 150 was converted to the UTF-16       which is 8211  If you have this case too  you can use the following instead   public static string Utf8ToUtf16 string utf8String           Get UTF-8 bytes by reading each byte with ANSI encoding     byte   utf8Bytes   Encoding Default GetBytes utf8String           Convert UTF-8 bytes to UTF-16 bytes     byte   utf16Bytes   Encoding Convert Encoding UTF8  Encoding Unicode  utf8Bytes           Return UTF-16 bytes as UTF-16 string     return Encoding Unicode GetString utf16Bytes       Or the Native-Method    DllImport  kernel32 dll    private static extern Int32 MultiByteToWideChar UInt32 CodePage  UInt32 dwFlags   MarshalAs UnmanagedType LPStr   String lpMultiByteStr  Int32 cbMultiByte   Out  MarshalAs UnmanagedType LPWStr   StringBuilder lpWideCharStr  Int32 cchWideChar    public static string Utf8ToUtf16 string utf8String        Int32 iNewDataLen   MultiByteToWideChar Convert ToUInt32 Encoding UTF8 CodePage   0  utf8String  -1  null  0       if  iNewDataLen  gt  1                StringBuilder utf16String   new StringBuilder iNewDataLen           MultiByteToWideChar Convert ToUInt32 Encoding UTF8 CodePage   0  utf8String  -1  utf16String  utf16String Capacity            return utf16String ToString              else               return String Empty            If you need it the other way around  see Utf16ToUtf8  Hope I could be of help

User · Answer

What you have seems to be a string incorrectly decoded from another encoding  likely code page 1252  which is US Windows default   Here s how to reverse  assuming no other loss   One loss not immediately apparent is the non-breaking space  U 00A0  at the end of your string that is not displayed   Of course it would be better to read the data source correctly in the first place  but perhaps the data source was stored incorrectly to begin with   using System  using System Text   class Program       static void Main string   args                string junk    d    j   xa0       Bad Unicode string             Turn string back to bytes using the original  incorrect encoding          byte   bytes   Encoding GetEncoding 1252  GetBytes junk               Use the correct encoding this time to convert back to a string          string good   Encoding UTF8 GetString bytes           Console WriteLine good             Result   d  j

User · Answer

I have string that displays UTF-8 encoded characters   There is no such thing in  NET  The string class can only store strings in UTF-16 encoding  A UTF-8 encoded string can only exist as a byte    Trying to store bytes into a string will not come to a good end  UTF-8 uses byte values that don t have a valid Unicode codepoint  The content will be destroyed when the string is normalized  So it is already too late to recover the string by the time your DecodeFromUtf8   starts running   Only handle UTF-8 encoded text with byte    And use UTF8Encoding GetString   to convert it

User · Answer

So the issue is that UTF-8 code unit values have been stored as a sequence of 16-bit code units in a C  string  You simply need to verify that each code unit is within the range of a byte  copy those values into bytes  and then convert the new UTF-8 byte sequence into UTF-16   public static string DecodeFromUtf8 this string utf8String           copy the string as UTF-8 bytes      byte   utf8Bytes   new byte utf8String Length       for  int i 0 i lt utf8String Length   i              Debug Assert  0  lt   utf8String i   amp  amp  utf8String i   lt   255   the char must be in byte s range            utf8Bytes i     byte utf8String i              return Encoding UTF8 GetString utf8Bytes 0 utf8Bytes Length      DecodeFromUtf8  d u00C3 u00A9j u00C3 u00A0       d  j     This is easy  however it would be best to find the root cause  the location where someone is copying UTF-8 code units into 16 bit code units  The likely culprit is somebody converting bytes into a C  string using the wrong encoding  E g  Encoding Default GetString utf8Bytes  0  utf8Bytes Length      Alternatively  if you re sure you know the incorrect encoding which was used to produce the string  and that incorrect encoding transformation was lossless  usually the case if the incorrect encoding is a single byte encoding   then you can simply do the inverse encoding step to get the original UTF-8 data  and then you can do the correct conversion from UTF-8 bytes   public static string UndoEncodingMistake string mangledString  Encoding mistake  Encoding correction           the inverse of  mistake GetString originalBytes        byte   originalBytes   mistake GetBytes mangledString       return correction GetString originalBytes      UndoEncodingMistake  d u00C3 u00A9j u00C3 u00A0   Encoding 1252   Encoding UTF8

[c#] How to convert a UTF-8 string into Unicode?

Examples related to c#

Examples related to string

Examples related to unicode

Examples related to utf-8