Decode UTF-8 with Javascript

Question

I have Javascript in an XHTML web page that is passing UTF-8 encoded strings  It needs to continue to pass the UTF-8 version  as well as decode it  How is it possible to decode a UTF-8 string for display    lt script type  text javascript  gt      lt   CDATA  function updateUser usernameSent       var usernameReceived   usernameSent     Current value  Gr                e     var usernameDecoded   usernameReceived      Decode to  Gr    e     var html2id           html2id     Encoded      usernameReceived     lt br   gt Decoded      usernameDecoded      document getElementById  userId   innerHTML   html2id          gt   lt  script gt

User · Answer

String to Utf8 ByteBuffer  function strToUTF8 str     return Uint8Array from encodeURIComponent str  replace        g  m v   gt  return String fromCodePoint parseInt v 16      c  gt c codePointAt 0          Utf8 ByteArray to string  function UTF8toStr ba     return decodeURIComponent ba reduce  p c   gt  return p     c toString 16

User · Answer

Perhaps using the textDecoder will be sufficient  Not supported in IE though  var decoder   new TextDecoder  utf-8        decodedMessage   decodedMessage   decoder decode message data    Handling non-UTF8 text In this example  we decode the Russian text  quot              quot   which means  quot Hello  world  quot  In our TextDecoder   constructor  we specify the Windows-1251 character encoding  which is appropriate for Cyrillic script   x000D   x000D      let win1251decoder   new TextDecoder  windows-1251        let bytes   new Uint8Array  207  240  232  226  229  242  44  32  236  232  240  33        console log win1251decoder decode bytes                    x000D   x000D   x000D   The interface for the TextDecoder is described here  Retrieving a byte array from a string is equally simpel   x000D   x000D  const decoder   new TextDecoder    const encoder   new TextEncoder     const byteArray   encoder encode  Gr    e       converted it to a byte array     now we can decode it back to a string if desired console log decoder decode byteArray    x000D   x000D   x000D   If you have it in a different encoding then you must compensate for that upon encoding  The parameter in the constructor for the TextEncoder is any one of the valid encodings listed here

User · Answer

albert s solution was the closest I think but it can only parse up to 3 byte utf-8 characters  function utf8ArrayToStr array      var out  i  len  c    var char2  char3     out         len   array length    i   0        XXX  Invalid bytes are ignored   while i  lt  len        c   array i         if  c  gt  gt  7    0             0xxx xxxx       out    String fromCharCode c         continue                Invalid starting byte     if  c  gt  gt  6    0x02          continue                     MULTIBYTE             How many bytes left for thus character      var extraLength   null      if  c  gt  gt  5    0x06          extraLength   1        else if  c  gt  gt  4    0x0e          extraLength   2        else if  c  gt  gt  3    0x1e          extraLength   3        else if  c  gt  gt  2    0x3e          extraLength   4        else if  c  gt  gt  1    0x7e          extraLength   5        else         continue                Do we have enough bytes in our data      if  i extraLength  gt  len          var leftovers   array slice i-1             If there is an invalid byte in the leftovers we might want to          continue from there        for    i  lt  len  i    if  array i   gt  gt  6    0x02  break        if  i    len  continue            All leftover bytes are valid        return  result  out  leftovers  leftovers                Remove the UTF-8 prefix from the char  res      var mask    1  lt  lt   8 - extraLength - 1   - 1          res   c  amp  mask  nextChar  count       for  count   0  count  lt  extraLength  count            nextChar   array i               Is the char valid multibyte part        if  nextChar  gt  gt  6    0x02   break          res    res  lt  lt  6     nextChar  amp  0x3f              if  count    extraLength          i--        continue             if  res  lt   0xffff          out    String fromCharCode res         continue             res -  0x10000      var high     res  gt  gt  10   amp  0x3ff    0xd800          low    res  amp  0x3ff    0xdc00      out    String fromCharCode high  low          return  result  out  leftovers           This returns  result   parsed string   leftovers   list of invalid bytes at the end   in case you are parsing the string in chunks   EDIT  fixed the issue that  unhammer found

User · Answer

This should work      http   www onicos com staff iz amuse javascript expert utf txt     utf js - UTF-8  lt   gt  UTF-16 convertion       Copyright  C  1999 Masanao Izumo  lt iz onicos co jp gt     Version  1 0    LastModified  Dec 25 1999    This library is free   You can redistribute it and or modify it       function Utf8ArrayToStr array        var out  i  len  c      var char2  char3       out           len   array length      i   0      while i  lt  len        c   array i         switch c  gt  gt  4               case 0  case 1  case 2  case 3  case 4  case 5  case 6  case 7             0xxxxxxx         out    String fromCharCode c           break        case 12  case 13             110x xxxx   10xx xxxx         char2   array i             out    String fromCharCode   c  amp  0x1F   lt  lt  6     char2  amp  0x3F            break        case 14             1110 xxxx  10xx xxxx  10xx xxxx         char2   array i             char3   array i             out    String fromCharCode   c  amp  0x0F   lt  lt  12                             char2  amp  0x3F   lt  lt  6                             char3  amp  0x3F   lt  lt  0            break                   return out      Check out the JSFiddle demo   Also see the related questions  here and here

User · Answer

Here is a solution handling all Unicode code points include upper  4 byte  values and supported by all modern browsers  IE and others   5 5    It uses decodeURIComponent    but NOT the deprecated escape unescape functions   function utf8 to str a        for var i 0  s     i lt a length  i              var h   a i  toString 16          if h length  lt  2  h    0    h         s          h           return decodeURIComponent s      Tested and available on GitHub  To create UTF-8 from a string   function utf8 from str s        for var i 0  enc   encodeURIComponent s   a       i  lt  enc length             if enc i                         a push parseInt enc substr i 1  2   16               i    3           else               a push enc charCodeAt i                         return a     Tested and available on GitHub

User · Answer

I searched for a simple solution and this works well for me     input data view   new Uint8Array data      output string serialString   ua2text view      convert UTF8 to string function ua2text ua        s           for  var i   0  i  lt  ua length  i              s    String fromCharCode ua i              return s                     Only issue I have is sometimes I get one character at a time  This might be by design with my source of the arraybuffer  I m using https   github com xseignard cordovarduino to read serial data on an android device

User · Answer

Using my 1 6KB library  you can do  ToString FromUTF8 Array from usernameReceived

User · Answer

This is a solution with extensive error reporting    It would take an UTF-8 encoded byte array  where byte array is represented as  array of numbers and each number is an integer between 0 and 255 inclusive  and will produce a JavaScript string of Unicode characters   function getNextByte value  startByteIndex  startBitsStr                        additional  index         if  index  gt   value length            var startByte   value startByteIndex           throw new Error  Invalid UTF-8 sequence  Byte     startByteIndex                  with value     startByte          String fromCharCode startByte                    binary      toBinary startByte                   starts with     startBitsStr     in binary and thus requires                  additional     bytes after it  but we only have                   value length - startByteIndex                    var byteValue   value index       checkNextByteFormat value  startByteIndex  startBitsStr  additional  index       return byteValue     function checkNextByteFormat value  startByteIndex  startBitsStr                                additional  index         if   value index   amp  0xC0     0x80            var startByte   value startByteIndex           var wrongByte   value index           throw new Error  Invalid UTF-8 byte sequence  Byte     startByteIndex                   with value     startByte         String fromCharCode startByte                     binary      toBinary startByte       starts with                   startBitsStr     in binary and thus requires     additional                   additional bytes  each of which shouls start with 10 in binary                    However byte      index - startByteIndex                    after it with value     wrongByte                        String fromCharCode wrongByte       binary      toBinary wrongByte                   does not start with 10 in binary              function fromUtf8  str            var value               var destIndex   0          for  var index   0  index  lt  str length  index                  var code   str charCodeAt index               if  code  lt   0x7F                    value destIndex      code                else if  code  lt   0x7FF                    value destIndex        code  gt  gt  6    amp  0x1F    0xC0                  value destIndex        code  gt  gt  0    amp  0x3F    0x80                else if  code  lt   0xFFFF                    value destIndex        code  gt  gt  12   amp  0x0F    0xE0                  value destIndex        code  gt  gt  6    amp  0x3F    0x80                  value destIndex        code  gt  gt  0    amp  0x3F    0x80                else if  code  lt   0x1FFFFF                    value destIndex        code  gt  gt  18   amp  0x07    0xF0                  value destIndex        code  gt  gt  12   amp  0x3F    0x80                  value destIndex        code  gt  gt  6    amp  0x3F    0x80                  value destIndex        code  gt  gt  0    amp  0x3F    0x80                else if  code  lt   0x03FFFFFF                    value destIndex        code  gt  gt  24   amp  0x03    0xF0                  value destIndex        code  gt  gt  18   amp  0x3F    0x80                  value destIndex        code  gt  gt  12   amp  0x3F    0x80                  value destIndex        code  gt  gt  6    amp  0x3F    0x80                  value destIndex        code  gt  gt  0    amp  0x3F    0x80                else if  code  lt   0x7FFFFFFF                    value destIndex        code  gt  gt  30   amp  0x01    0xFC                  value destIndex        code  gt  gt  24   amp  0x3F    0x80                  value destIndex        code  gt  gt  18   amp  0x3F    0x80                  value destIndex        code  gt  gt  12   amp  0x3F    0x80                  value destIndex        code  gt  gt  6    amp  0x3F    0x80                  value destIndex        code  gt  gt  0    amp  0x3F    0x80                else                   throw new Error  Unsupported Unicode character                            str charAt index        with code     code      binary                           toBinary code       at index     index                          Cannot represent it as UTF-8 byte sequence                                     return value

User · Answer

Update  Albert s answer adding condition for emoji    function Utf8ArrayToStr array        var out  i  len  c      var char2  char3  char4       out           len   array length      i   0      while i  lt  len        c   array i         switch c  gt  gt  4               case 0  case 1  case 2  case 3  case 4  case 5  case 6  case 7             0xxxxxxx         out    String fromCharCode c           break        case 12  case 13             110x xxxx   10xx xxxx         char2   array i             out    String fromCharCode   c  amp  0x1F   lt  lt  6     char2  amp  0x3F            break        case 14             1110 xxxx  10xx xxxx  10xx xxxx         char2   array i             char3   array i             out    String fromCharCode   c  amp  0x0F   lt  lt  12                             char2  amp  0x3F   lt  lt  6                             char3  amp  0x3F   lt  lt  0            break       case 15             1111 0xxx 10xx xxxx 10xx xxxx 10xx xxxx         char2   array i             char3   array i             char4   array i             out    String fromCodePoint   c  amp  0x07   lt  lt  18      char2  amp  0x3F   lt  lt  12      char3  amp  0x3F   lt  lt  6     char4  amp  0x3F             break             return out

User · Answer

I reckon the easiest way would be to use a built-in js functions decodeURI     encodeURI     function  usernameSent      var usernameEncoded   usernameSent     Current value  utf8   var usernameDecoded   decodeURI usernameReceived       Decoded      do stuff

User · Answer

To answer the original question  here is how you decode utf-8 in javascript   http   ecmanaut blogspot ca 2006 07 encoding-decoding-utf8-in-javascript html  Specifically   function encode utf8 s      return unescape encodeURIComponent s       function decode utf8 s      return decodeURIComponent escape s        We have been using this in our production code for 6 years  and it has worked flawlessly   Note  however  that escape   and unescape   are deprecated  See this

User · Answer

You should take decodeURI for it  https   developer mozilla org en-US docs Web JavaScript Reference Global Objects decodeURI As simple as this  decodeURI  https   developer mozilla org ru docs JavaScript  D1 88 D0 B5 D0 BB D0 BB D1 8B        quot https   developer mozilla org ru docs JavaScript       quot   Consider to use it inside try catch block for not missing an URIError  Also it has full browsers support

User · Answer

This is what I found after a more specific Google search than just UTF-8 encode decode  so for those who are looking for a converting library to convert between encodings  here you go   https   github com inexorabletash text-encoding  var uint8array   new TextEncoder   encode str   var str   new TextDecoder encoding  decode uint8array     Paste from repo readme  All encodings from the Encoding specification are supported   utf-8 ibm866 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-8-i iso-8859-10 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 koi8-r koi8-u macintosh windows-874 windows-1250 windows-1251 windows-1252 windows-1253 windows-1254 windows-1255 windows-1256 windows-1257 windows-1258 x-mac-cyrillic gb18030 hz-gb-2312 big5 euc-jp iso-2022-jp shift jis euc-kr replacement utf-16be utf-16le x-user-defined   Some encodings may be supported under other names  e g  ascii  iso-8859-1  etc  See Encoding for additional labels for each encoding

[javascript] Decode UTF-8 with Javascript

Examples related to javascript

Examples related to unicode

Examples related to utf8-decode

Examples related to xhtml-transitional