[javascript] How do I decode a string with escaped unicode?

Using JSON.decode for this comes with significant drawbacks that you must be aware of:

  • You must wrap the string in double quotes
  • Many characters are not supported and must be escaped themselves. For example, passing any of the following to JSON.decode (after wrapping them in double quotes) will error even though these are all valid: \\n, \n, \\0, a"a
  • It does not support hexadecimal escapes: \\x45
  • It does not support Unicode code point sequences: \\u{045}

There are other caveats as well. Essentially, using JSON.decode for this purpose is a hack and doesn't work the way you might always expect. You should stick with using the JSON library to handle JSON, not for string operations.


I recently ran into this issue myself and wanted a robust decoder, so I ended up writing one myself. It's complete and thoroughly tested and is available here: https://github.com/iansan5653/unraw. It mimics the JavaScript standard as closely as possible.

Explanation:

The source is about 250 lines so I won't include it all here, but essentially it uses the following Regex to find all escape sequences and then parses them using parseInt(string, 16) to decode the base-16 numbers and then String.fromCodePoint(number) to get the corresponding character:

/\\(?:(\\)|x([\s\S]{0,2})|u(\{[^}]*\}?)|u([\s\S]{4})\\u([^{][\s\S]{0,3})|u([\s\S]{0,4})|([0-3]?[0-7]{1,2})|([\s\S])|$)/g

Commented (NOTE: This regex matches all escape sequences, including invalid ones. If the string would throw an error in JS, it throws an error in my library [ie, '\x!!' will error]):

/
\\ # All escape sequences start with a backslash
(?: # Starts a group of 'or' statements
(\\) # If a second backslash is encountered, stop there (it's an escaped slash)
| # or
x([\s\S]{0,2}) # Match valid hexadecimal sequences
| # or
u(\{[^}]*\}?) # Match valid code point sequences
| # or
u([\s\S]{4})\\u([^{][\s\S]{0,3}) # Match surrogate code points which get parsed together
| # or
u([\s\S]{0,4}) # Match non-surrogate Unicode sequences
| # or
([0-3]?[0-7]{1,2}) # Match deprecated octal sequences
| # or
([\s\S]) # Match anything else ('.' doesn't match newlines)
| # or
$ # Match the end of the string
) # End the group of 'or' statements
/g # Match as many instances as there are

Example

Using that library:

import unraw from "unraw";

let step1 = unraw('http\\u00253A\\u00252F\\u00252Fexample.com');
// yields "http%3A%2F%2Fexample.com"
// Then you can use decodeURIComponent to further decode it:
let step2 = decodeURIComponent(step1);
// yields http://example.com