Using Javascript s atob to decode base64 doesn t properly decode utf-8 strings

Question

I m using the Javascript window atob   function to decode a base64-encoded string  specifically the base64-encoded content from the GitHub API   Problem is I m getting ASCII-encoded characters back  like      instead of       How can I properly handle the incoming base64-encoded stream so that it s decoded as utf-8

User · Answer

Here is 2018 updated solution as described in the Mozilla Development Resources

TO ENCODE FROM UNICODE TO B64

function b64EncodeUnicode(str) {
    // first we use encodeURIComponent to get percent-encoded UTF-8,
    // then we convert the percent encodings into raw bytes which
    // can be fed into btoa.
    return btoa(encodeURIComponent(str).replace(/%([0-9A-F]{2})/g,
        function toSolidBytes(match, p1) {
            return String.fromCharCode('0x' + p1);
    }));
}

b64EncodeUnicode('? à la mode'); // "4pyTIMOgIGxhIG1vZGU="
b64EncodeUnicode('\n'); // "Cg=="

TO DECODE FROM B64 TO UNICODE

function b64DecodeUnicode(str) {
    // Going backwards: from bytestream, to percent-encoding, to original string.
    return decodeURIComponent(atob(str).split('').map(function(c) {
        return '%' + ('00' + c.charCodeAt(0).toString(16)).slice(-2);
    }).join(''));
}

b64DecodeUnicode('4pyTIMOgIGxhIG1vZGU='); // "? à la mode"
b64DecodeUnicode('Cg=='); // "\n"

User · Answer

including above solution if still facing issue try as below  Considerign the case where escape is not supported for TS   blob   new Blob    ufeff   csv content       this will make symbols to appears in excel    for csv content you can try like below   function b64DecodeUnicode str  any                    return decodeURIComponent atob str  split     map  c  any    gt                return         00    c charCodeAt 0  toString 16   slice -2              join

User · Answer

Here s some future-proof code for browsers that may lack escape unescape    Note that IE 9 and older don t support atob btoa    so you d need to use custom base64 functions for them      Polyfill for escape unescape if   window unescape        window unescape   function  s            return s replace      0-9A-F  2   g  function  m  p                 return String fromCharCode   0x    p                          if   window escape        window escape   function  s            var chr  hex  i   0  l   s length  out               for    i  lt  l  i                   chr   s charAt  i                if  chr search    A-Za-z0-9         -          gt  -1                    out    chr  continue                hex   s charCodeAt  i   toString  16                out            hex length   2    0    0           hex                    return out               Base64 encoding of UTF-8 strings var utf8ToB64   function  s        return btoa  unescape  encodeURIComponent  s           var b64ToUtf8   function  s        return decodeURIComponent  escape  atob  s             A more comprehensive example for UTF-8 encoding and decoding can be found here  http   jsfiddle net 47zwb41o

User · Answer

The complete article that works for me  https   developer mozilla org en-US docs Web JavaScript Base64 encoding and decoding The part where we encode from Unicode UTF-8 is function utf8 to b64  str        return window btoa unescape encodeURIComponent  str         function b64 to utf8  str        return decodeURIComponent escape window atob  str            Usage  utf8 to b64       la mode        quot 4pyTIMOgIGxhIG1vZGU  quot  b64 to utf8  4pyTIMOgIGxhIG1vZGU         quot      la mode quot   This is one of the most used methods nowadays

User · Answer

I would assume that one might want a solution that produces a widely useable base64 URI  Please visit data text plain charset utf-8 base64 4pi44pi54pi64pi74pi84pi 4pi  to see a demonstration  copy the data uri  open a new tab  paste the data URI into the address bar  then press enter to go to the page   Despite the fact that this URI is base64-encoded  the browser is still able to recognize the high code points and decode them properly  The minified encoder decoder is 1058 bytes   Gzip 589 bytes    function e   use strict  function h b  var a b charCodeAt 0  if 55296 lt  a amp  amp 56319 gt  a if b b charCodeAt 1  b   b amp  amp 56320 lt  b amp  amp 57343 gt  b  if a 1024  a-55296  b-56320 65536 65535 lt a return d 240 a gt  gt  gt 18 128 a gt  gt  gt 12 amp 63 128 a gt  gt  gt 6 amp 63 128 a amp 63  else return d 239 191 189  return 127 gt  a inputString 2047 gt  a d 192 a gt  gt  gt 6 128 a amp 63  d 224 a gt  gt  gt 12 128 a gt  gt  gt 6 amp 63 128 a amp 63  function k b  var a b charCodeAt 0  lt  lt 24 f l  a  c 0 e b length g    if 5 gt f amp  amp e gt  f  a a lt  lt f gt  gt  gt 24 f for c 1 c lt f   c a a lt  lt 6 b charCodeAt c  amp 63 65535 gt  a g  d a  1114111 gt  a  a- 65536 g  d  a gt  gt 10  55296  a amp 1023  56320   c 0 for  c lt e   c g    ufffd  return g var m Math log n Math LN2 l Math clz32  function b  return 31-m b gt  gt  gt 0  n 0  d String fromCharCode p atob q btoa e btoaUTF8 function b a  return q  a   u00ef u00bb u00bf      b replace    x80- uD7ff uDC00- uFFFF    uD800- uDBFF   uDC00- uDFFF   g h    e atobUTF8 function b a  a    u00ef u00bb u00bf    b substring 0 3    b b substring 3   return p b  replace    xc0- xff   x80- xbf   g k       void 0  typeof global    void 0  typeof self this self global    Below is the source code used to generate it   var fromCharCode   String fromCharCode  var btoaUTF8    function btoa  replacer   use strict       return function inputString  BOMit           return btoa  BOMit     xEF xBB xBF          inputString replace                 x80- uD7ff uDC00- uFFFF    uD800- uDBFF   uDC00- uDFFF   g  replacer                      btoa  function nonAsciiChars   use strict          make the UTF string into a binary UTF-8 encoded string     var point   nonAsciiChars charCodeAt 0       if  point  gt   0xD800  amp  amp  point  lt   0xDBFF            var nextcode   nonAsciiChars charCodeAt 1           if  nextcode     nextcode     NaN because string is 1 code point long             return fromCharCode 0xef  11101111    0xbf  10111111    0xbd  10111101                https   mathiasbynens be notes javascript-encoding surrogate-formulae         if  nextcode  gt   0xDC00  amp  amp  nextcode  lt   0xDFFF                point    point - 0xD800    0x400   nextcode - 0xDC00   0x10000              if  point  gt  0xffff                  return fromCharCode                       0x1e  0b11110   lt  lt 3     point gt  gt  gt 18                        0x2  0b10   lt  lt 6      point gt  gt  gt 12  amp 0x3f  0b00111111                          0x2  0b10   lt  lt 6      point gt  gt  gt 6  amp 0x3f  0b00111111                          0x2  0b10   lt  lt 6     point amp 0x3f  0b00111111                                 else return fromCharCode 0xef  0xbf  0xbd             if  point  lt   0x007f  return nonAsciiChars      else if  point  lt   0x07ff            return fromCharCode  0x6 lt  lt 5   point gt  gt  gt 6    0x2 lt  lt 6   point amp 0x3f          else return fromCharCode           0xe  0b1110   lt  lt 4     point gt  gt  gt 12            0x2  0b10   lt  lt 6      point gt  gt  gt 6  amp 0x3f  0b00111111              0x2  0b10   lt  lt 6     point amp 0x3f  0b00111111                 Then  to decode the base64 data  either HTTP get the data as a data URI or use the function below   var clz32   Math clz32     function log  LN2   use strict       return function x   return 31 - log x  gt  gt  gt  0    LN2   0      Math log  Math LN2   var fromCharCode   String fromCharCode  var atobUTF8    function atob  replacer   use strict       return function inputString  keepBOM           inputString   atob inputString           if   keepBOM  amp  amp  inputString substring 0 3        xEF xBB xBF               inputString   inputString substring 3      eradicate UTF-8 BOM            0xc0   gt  0b11000000  0xff   gt  0b11111111  0xc0-0xff   gt  0b11xxxxxx            0x80   gt  0b10000000  0xbf   gt  0b10111111  0x80-0xbf   gt  0b10xxxxxx         return inputString replace    xc0- xff   x80- xbf   g  replacer            atob  function encoded   use strict       var codePoint   encoded charCodeAt 0   lt  lt  24      var leadingOnes   clz32  codePoint       var endPos   0  stringLen   encoded length      var result           if  leadingOnes  lt  5  amp  amp  stringLen  gt   leadingOnes            codePoint    codePoint lt  lt leadingOnes  gt  gt  gt  24 leadingOnes           for  endPos   1  endPos  lt  leadingOnes    endPos              codePoint    codePoint lt  lt 6     encoded charCodeAt endPos  amp 0x3f  0b00111111             if  codePoint  lt   0xFFFF       BMP code point           result    fromCharCode codePoint             else if  codePoint  lt   0x10FFFF                 https   mathiasbynens be notes javascript-encoding surrogate-formulae           codePoint -  0x10000            result    fromCharCode               codePoint  gt  gt  10    0xD800      highSurrogate              codePoint  amp  0x3ff    0xDC00    lowSurrogate                        else endPos   0     to fill it in with INVALIDs           for    endPos  lt  stringLen    endPos  result      ufffd      replacement character     return result        The advantage of being more standard is that this encoder and this decoder are more widely applicable because they can be used as a valid URL that displays correctly  Observe    x000D   x000D   function window   x000D       use strict   x000D      var sourceEle   document getElementById  source    x000D      var urlBarEle   document getElementById  urlBar    x000D      var mainFrameEle   document getElementById  mainframe    x000D      var gotoButton   document getElementById  gotoButton    x000D      var parseInt   window parseInt  x000D      var fromCodePoint   String fromCodePoint  x000D      var parse   JSON parse  x000D       x000D      function unescape str   x000D          return str replace    u  da-f  0 4    x  da-f  0 2    u           bfnrtv        0 0-7  1 3     d 1 3  g  function match   x000D            try  x000D              if  match startsWith    u     x000D                return fromCodePoint parseInt match slice 2 -1  16    x000D              if  match startsWith    u      match startsWith    x    x000D                return fromCodePoint parseInt match substring 2  16    x000D              if  match startsWith    0    amp  amp  match length  gt  2  x000D                return fromCodePoint parseInt match substring 2  8    x000D              if       d  test match   return fromCodePoint  match slice 1    x000D             catch e  return   ufffd  repeat match length   x000D            return parse       match         x000D              x000D        x000D       x000D      function whenChange    x000D        try  urlBarEle value    data text plain charset UTF-8 base64     btoaUTF8 unescape sourceEle value   true   x000D          finally  gotoURL      x000D        x000D      sourceEle addEventListener  change  whenChange  passive 1    x000D      sourceEle addEventListener  input  whenChange  passive 1    x000D       x000D         IFrame Setup  x000D      function gotoURL   mainFrameEle src   urlBarEle value  x000D      gotoButton addEventListener  click   gotoURL   passive  1    x000D      function urlChanged   urlBarEle value   mainFrameEle src  x000D      mainFrameEle addEventListener  load   urlChanged   passive  1    x000D      urlBarEle addEventListener  keypress   function evt   x000D        if  evt key      enter   evt preventDefault    urlChanged    x000D          passive  1    x000D       x000D           x000D      var fromCharCode   String fromCharCode  x000D      var btoaUTF8    function btoa  replacer   x000D         use strict   x000D          return function inputString  BOMit   x000D           return btoa  BOMit   xEF xBB xBF        inputString replace  x000D               x80- uD7ff uDC00- uFFFF    uD800- uDBFF   uDC00- uDFFF   g  replacer x000D            x000D         x000D         btoa  function nonAsciiChars   x000D     use strict   x000D          make the UTF string into a binary UTF-8 encoded string x000D       var point   nonAsciiChars charCodeAt 0   x000D       if  point  gt   0xD800  amp  amp  point  lt   0xDBFF    x000D        var nextcode   nonAsciiChars charCodeAt 1   x000D        if  nextcode     nextcode       NaN because string is 1code point long x000D         return fromCharCode 0xef  11101111    0xbf  10111111    0xbd  10111101     x000D          x000D           https   mathiasbynens be notes javascript-encoding surrogate-formulae x000D        if  nextcode  gt   0xDC00  amp  amp  nextcode  lt   0xDFFF    x000D         point    point - 0xD800    0x400   nextcode - 0xDC00   0x10000  x000D         if  point  gt  0xffff    x000D          return fromCharCode  x000D            0x1e  0b11110   lt  lt 3     point gt  gt  gt 18   x000D            0x2  0b10   lt  lt 6      point gt  gt  gt 12  amp 0x3f  0b00111111     x000D            0x2  0b10   lt  lt 6      point gt  gt  gt 6  amp 0x3f  0b00111111     x000D            0x2  0b10   lt  lt 6     point amp 0x3f  0b00111111    x000D             x000D           x000D          else   x000D         return fromCharCode 0xef  0xbf  0xbd   x000D          x000D         x000D       if  point  lt   0x007f    return inputString    x000D       else if  point  lt   0x07ff    x000D        return fromCharCode  0x6 lt  lt 5   point gt  gt  gt 6    0x2 lt  lt 6   point amp 0x3f  00111111      x000D         else   x000D        return fromCharCode  x000D          0xe  0b1110   lt  lt 4     point gt  gt  gt 12   x000D          0x2  0b10   lt  lt 6      point gt  gt  gt 6  amp 0x3f  0b00111111     x000D          0x2  0b10   lt  lt 6     point amp 0x3f  0b00111111    x000D           x000D         x000D          x000D      setTimeout whenChange  0   x000D     window   x000D  img active opacity 0 8  x000D   lt center gt  x000D   lt textarea id  source  style  width 66 7vw  gt Hello  u1234 W 186 0256ld  x000D  Enter text into the top box  Then the URL will update automatically  x000D   lt  textarea gt  lt br   gt  x000D   lt div style  width 66 7vw display inline-block height calc 25vw   1em   6px  border 2px solid text-align left line-height 1em  gt  x000D   lt input id  urlBar  style  width calc 100  - 1em - 13px     gt  lt img id  gotoButton  src  data image png base64 iVBORw0KGgoAAAANSUhEUgAAABsAAAAeCAMAAADqx5XUAAAAclBMVEX   9NczZ8e32ko6fDxsU fBoSQgdFtwA5pAHVxt 7vLzq5ex23y4SXABLiiTm0  c2N6DhoQ6WSxSyweVlZVvdG Uz9aF5kYlbwElkwAggACxs7Jl3hX07 cQbQCar5SU9lRntEWGum C9zIDHwCGnH5IvZAOAAABmUlEQVQoz7WS25acIBBFkRLkIgKKtOCttbv  xdDmTGZzHv2S63ltuBQQP4rdRiRUP8UK4wh6nVddQwj NtDQTvac8577zTQb72zj65 876qqt7wykU6 1U6vFEgjE1mt 5LRqrpu7oVsn0sjZejMfxR3W yLikqAFcUx93YxLmZGOtElmEu6Ufd9xV3ZDTGcEvGLbMk0mHHlUSvS5svCwS hVL8loQQyfpI1Ay8RF xlNxcsTchGjGDIuBG3Ik7TMyNxn8m0TSnBAK6Z8UZfp3IbAonmJvmsEACum6aNv7B0CnvpezDcNhw9XWsuAr7qnRg6dABmeM4dTgn DZdXWs3LMspZ1KDMt1kcPJ6S1icWNp2qaEmjq6myx7jbQK3VKItLJaW5FR cuYlRhYNKzGa9vF4vM5roLW3OSVjkmiGJrPhUq301 16pVKZRGFYWjTP50spTxBN5Z4EKnSonruk n4tUokv1aJSEl MLZU90S3L6 U6o0J142iQVp3HcZxKSo8LfkNRCtJaKYFSRX7iaoAAUDty8wvWYR6HJEepdwAAAABJRU5ErkJggg    style  width calc 1em   4px  line-height 1em vertical-align -40  cursor pointer    gt  x000D   lt iframe id  mainframe  style  width 66 7vw height 25vw  frameBorder  0  gt  lt  iframe gt  x000D   lt  div gt  x000D   lt  center gt  x000D   x000D   x000D    In addition to being very standardized  the above code snippets are also very fast  Instead of an indirect chain of succession where the data has to be converted several times between various forms  such as in Riccardo Galli s response   the above code snippet is as direct as performantly possible  It uses only one simple fast String prototype replace call to process the data when encoding  and only one to decode the data when decoding  Another plus is that  especially for big strings   String prototype replace allows the browser to automatically handle the underlying memory management of resizing the string  leading a significant performance boost especially in evergreen browsers like Chrome and Firefox that heavily optimize String prototype replace  Finally  the icing on the cake is that for you latin script exclusivo users  strings which don t contain any code points above 0x7f are extra fast to process because the string remains unmodified by the replacement algorithm   I have created a github repository for this solution at https   github com anonyco BestBase64EncoderDecoder

User · Answer

If treating strings as bytes is more your thing  you can use the following functions  function u atob ascii        return Uint8Array from atob ascii   c   gt  c charCodeAt 0       function u btoa buffer        var binary           var bytes   new Uint8Array buffer       for  var i   0  il   bytes byteLength  i  lt  il  i              binary push String fromCharCode bytes i               return btoa binary join              example  it works also with astral plane characters such as    var encodedString   new TextEncoder   encode       var base64String   u btoa encodedString   console log         new TextDecoder   decode u atob base64String

User · Answer

Decoding base64 to UTF8 String Below is current most voted answer by  brandonscript function b64DecodeUnicode str           Going backwards  from bytestream  to percent-encoding  to original string      return decodeURIComponent atob str  split     map function c            return         00    c charCodeAt 0  toString 16   slice -2          join          Above code can work  but it s very slow  If your input is a very large base64 string  for example 30 000 chars for a base64 html document  It will need lots of computation  Here is my answer  use built-in TextDecoder  nearly 10x faster than above code for large input  function decodeBase64 base64        const text   atob base64       const length   text length      const bytes   new Uint8Array length       for  let i   0  i  lt  length  i              bytes i    text charCodeAt i             const decoder   new TextDecoder       default is utf-8     return decoder decode bytes

User · Answer

There s a great article on Mozilla s MDN  docs that describes exactly this issue      The  Unicode Problem    Since DOMStrings are 16-bit-encoded strings  in most browsers calling window btoa on a Unicode string will cause a Character Out Of Range exception if a character exceeds the range of a 8-bit byte  0x00 0xFF   There are two possible methods to solve this problem          the first one is to escape the whole string  with UTF-8  see encodeURIComponent  and then encode it    the second one is to convert the UTF-16 DOMString to an UTF-8 array of characters and then encode it       A note on previous solutions  the MDN article originally suggested using unescape and escape to solve the Character Out Of Range exception problem  but they have since been deprecated  Some other answers here have suggested working around this with decodeURIComponent and encodeURIComponent  this has proven to be unreliable and unpredictable  The most recent update to this answer uses modern JavaScript functions to improve speed and modernize code   If you re trying to save yourself some time  you could also consider using a library    js-base64  NPM  great for Node js  base64-js   Encoding UTF8   base64  function b64EncodeUnicode str           first we use encodeURIComponent to get percent-encoded UTF-8         then we convert the percent encodings into raw bytes which        can be fed into btoa      return btoa encodeURIComponent str  replace     0-9A-F  2   g          function toSolidBytes match  p1                return String fromCharCode  0x    p1               b64EncodeUnicode       la mode        4pyTIMOgIGxhIG1vZGU   b64EncodeUnicode   n        Cg      Decoding base64   UTF8  function b64DecodeUnicode str           Going backwards  from bytestream  to percent-encoding  to original string      return decodeURIComponent atob str  split     map function c            return         00    c charCodeAt 0  toString 16   slice -2          join          b64DecodeUnicode  4pyTIMOgIGxhIG1vZGU              la mode  b64DecodeUnicode  Cg           n      The pre-2018 solution  functional  and though likely better support for older browsers  not up to date   Here is the the current recommendation  direct from MDN  with some additional TypeScript compatibility via  MA-Maddin      Encoding UTF8   base64  function b64EncodeUnicode str        return btoa encodeURIComponent str  replace     0-9A-F  2   g  function match  p1            return String fromCharCode parseInt p1  16              b64EncodeUnicode       la mode       4pyTIMOgIGxhIG1vZGU   b64EncodeUnicode   n       Cg        Decoding base64   UTF8  function b64DecodeUnicode str        return decodeURIComponent Array prototype map call atob str   function c            return         00    c charCodeAt 0  toString 16   slice -2         join         b64DecodeUnicode  4pyTIMOgIGxhIG1vZGU             la mode  b64DecodeUnicode  Cg          n      The original solution  deprecated   This used escape and unescape  which are now deprecated  though this still works in all modern browsers    function utf8 to b64  str         return window btoa unescape encodeURIComponent  str         function b64 to utf8  str         return decodeURIComponent escape window atob  str            Usage  utf8 to b64       la mode        4pyTIMOgIGxhIG1vZGU   b64 to utf8  4pyTIMOgIGxhIG1vZGU              la mode      And one last thing  I first encountered this problem when calling the GitHub API  To get this to work on  Mobile  Safari properly  I actually had to strip all white space from the base64 source before I could even decode the source  Whether or not this is still relevant in 2017  I don t know   function b64 to utf8  str         str   str replace   s g               return decodeURIComponent escape window atob  str

User · Answer

Things change  The escape unescape methods have been deprecated   You can URI encode the string before you Base64-encode it  Note that this does t produce Base64-encoded UTF8  but rather Base64-encoded URL-encoded data  Both sides must agree on the same encoding   See working example here  http   codepen io anon pen PZgbPW       encode string var base64   window btoa encodeURIComponent                             decode string var str   decodeURIComponent window atob tmp       str is now                             For OP s problem a third party library such as js-base64 should solve the problem

User · Answer

Small correction  unescape and escape are deprecated  so   function utf8 to b64  str         return window btoa decodeURIComponent encodeURIComponent str        function b64 to utf8  str          return decodeURIComponent encodeURIComponent window atob str         function b64 to utf8  str         str   str replace   s g               return decodeURIComponent encodeURIComponent window atob str

[javascript] Using Javascript's atob to decode base64 doesn't properly decode utf-8 strings

Examples related to javascript

Examples related to encoding

Examples related to utf-8