[javascript] Javascript: Unicode string to hex

I'm trying to convert a unicode string to a hexadecimal representation in javascript.

This is what I have:

function convertFromHex(hex) {
    var hex = hex.toString();//force conversion
    var str = '';
    for (var i = 0; i < hex.length; i += 2)
        str += String.fromCharCode(parseInt(hex.substr(i, 2), 16));
    return str;
}

function convertToHex(str) {
    var hex = '';
    for(var i=0;i<str.length;i++) {
        hex += ''+str.charCodeAt(i).toString(16);
    }
    return hex;
}

But if fails on unicode characters, like chinese;

Input: ??

Output: ªo"[W

Any ideas? Can this be done in javascript?

This question is related to javascript unicode utf-8 hex

The answer is


A more up to date solution, for encoding:

// This is the same for all of the below, and
// you probably won't need it except for debugging
// in most cases.
function bytesToHex(bytes) {
  return Array.from(
    bytes,
    byte => byte.toString(16).padStart(2, "0")
  ).join("");
}

// You almost certainly want UTF-8, which is
// now natively supported:
function stringToUTF8Bytes(string) {
  return new TextEncoder().encode(string);
}

// But you might want UTF-16 for some reason.
// .charCodeAt(index) will return the underlying
// UTF-16 code-units (not code-points!), so you
// just need to format them in whichever endian order you want.
function stringToUTF16Bytes(string, littleEndian) {
  const bytes = new Uint8Array(string.length * 2);
  // Using DataView is the only way to get a specific
  // endianness.
  const view = new DataView(bytes.buffer);
  for (let i = 0; i != string.length; i++) {
    view.setUint16(i, string.charCodeAt(i), littleEndian);
  }
  return bytes;
}

// And you might want UTF-32 in even weirder cases.
// Fortunately, iterating a string gives the code
// points, which are identical to the UTF-32 encoding,
// though you still have the endianess issue.
function stringToUTF32Bytes(string, littleEndian) {
  const codepoints = Array.from(string, c => c.codePointAt(0));
  const bytes = new Uint8Array(codepoints.length * 4);
  // Using DataView is the only way to get a specific
  // endianness.
  const view = new DataView(bytes.buffer);
  for (let i = 0; i != codepoints.length; i++) {
    view.setUint32(i, codepoints[i], littleEndian);
  }
  return bytes;
}

Examples:

bytesToHex(stringToUTF8Bytes("hello ?? "))
// "68656c6c6f20e6bca2e5ad9720f09f918d"
bytesToHex(stringToUTF16Bytes("hello ?? ", false))
// "00680065006c006c006f00206f225b570020d83ddc4d"
bytesToHex(stringToUTF16Bytes("hello ?? ", true))
// "680065006c006c006f002000226f575b20003dd84ddc"
bytesToHex(stringToUTF32Bytes("hello ?? ", false))
// "00000068000000650000006c0000006c0000006f0000002000006f2200005b57000000200001f44d"
bytesToHex(stringToUTF32Bytes("hello ?? ", true))
// "68000000650000006c0000006c0000006f00000020000000226f0000575b0000200000004df40100"

For decoding, it's generally a lot simpler, you just need:

function hexToBytes(hex) {
    const bytes = new Uint8Array(hex.length / 2);
    for (let i = 0; i !== bytes.length; i++) {
        bytes[i] = parseInt(hex.substr(i * 2, 2), 16);
    }
    return bytes;
}

then use the encoding parameter of TextDecoder:

// UTF-8 is default
new TextDecoder().decode(hexToBytes("68656c6c6f20e6bca2e5ad9720f09f918d"));
// but you can also use:
new TextDecoder("UTF-16LE").decode(hexToBytes("680065006c006c006f002000226f575b20003dd84ddc"))
new TextDecoder("UTF-16BE").decode(hexToBytes("00680065006c006c006f00206f225b570020d83ddc4d"));
// "hello ?? "

Here's the list of allowed encoding names: https://www.w3.org/TR/encoding/#names-and-labels

You might notice UTF-32 is not on that list, which is a pain, so:

function bytesToStringUTF32(bytes, littleEndian) {
  const view = new DataView(bytes.buffer);
  const codepoints = new Uint32Array(view.byteLength / 4);
  for (let i = 0; i !== codepoints.length; i++) {
    codepoints[i] = view.getUint32(i * 4, littleEndian);
  }
  return String.fromCodePoint(...codepoints);
}

Then:

bytesToStringUTF32(hexToBytes("00000068000000650000006c0000006c0000006f0000002000006f2200005b57000000200001f44d"), false)
bytesToStringUTF32(hexToBytes("68000000650000006c0000006c0000006f00000020000000226f0000575b0000200000004df40100"), true)
// "hello ?? "

Here is a tweak of McDowell's algorithm that doesn't pad the result:

  function toHex(str) {
    var result = '';
    for (var i=0; i<str.length; i++) {
      result += str.charCodeAt(i).toString(16);
    }
    return result;
  }

It depends on what encoding you use. If you want to convert utf-8 encoded hex to string, use this:

function fromHex(hex,str){
  try{
    str = decodeURIComponent(hex.replace(/(..)/g,'%$1'))
  }
  catch(e){
    str = hex
    console.log('invalid hex input: ' + hex)
  }
  return str
}

For the other direction use this:

function toHex(str,hex){
  try{
    hex = unescape(encodeURIComponent(str))
    .split('').map(function(v){
      return v.charCodeAt(0).toString(16)
    }).join('')
  }
  catch(e){
    hex = str
    console.log('invalid text input: ' + str)
  }
  return hex
}

how do you get "\u6f22\u5b57" from ?? in JavaScript?

These are JavaScript Unicode escape sequences e.g. \u12AB. To convert them, you could iterate over every code unit in the string, call .toString(16) on it, and go from there.

However, it is more efficient to also use hexadecimal escape sequences e.g. \xAA in the output wherever possible.

Also note that ASCII symbols such as A, b, and - probably don’t need to be escaped.

I’ve written a small JavaScript library that does all this for you, called jsesc. It has lots of options to control the output.

Here’s an online demo of the tool in action: http://mothereff.in/js-escapes#1%E6%BC%A2%E5%AD%97


Your question was tagged as utf-8. Reading the rest of your question, UTF-8 encoding/decoding didn’t seem to be what you wanted here, but in case you ever need it: use utf8.js (online demo).


Here is my take: these functions convert a UTF8 string to a proper HEX without the extra zeroes padding. A real UTF8 string has characters with 1, 2, 3 and 4 bytes length.

While working on this I found a couple key things that solved my problems:

  1. str.split('') doesn't handle multi-byte characters like emojis correctly. The proper/modern way to handle this is with Array.from(str)
  2. encodeURIComponent() and decodeURIComponent() are great tools to convert between string and hex. They are pretty standard, they handle UTF8 correctly.
  3. (Most) ASCII characters (codes 0 - 127) don't get URI encoded, so they need to handled separately. But c.charCodeAt(0).toString(16) works perfectly for those
    function utf8ToHex(str) {
      return Array.from(str).map(c => 
        c.charCodeAt(0) < 128 ? c.charCodeAt(0).toString(16) : 
        encodeURIComponent(c).replace(/\%/g,'').toLowerCase()
      ).join('');
    },
    function hexToUtf8: function(hex) {
      return decodeURIComponent('%' + hex.match(/.{1,2}/g).join('%'));
    }

Demo: https://jsfiddle.net/lyquix/k2tjbrvq/


Here you go. :D

"??".split("").reduce((hex,c)=>hex+=c.charCodeAt(0).toString(16).padStart(4,"0"),"")
"6f225b57"

for non unicode

"hi".split("").reduce((hex,c)=>hex+=c.charCodeAt(0).toString(16).padStart(2,"0"),"")
"6869"

ASCII (utf-8) binary HEX string to string

"68656c6c6f20776f726c6421".match(/.{1,2}/g).reduce((acc,char)=>acc+String.fromCharCode(parseInt(char, 16)),"")

String to ASCII (utf-8) binary HEX string

"hello world!".split("").reduce((hex,c)=>hex+=c.charCodeAt(0).toString(16).padStart(2,"0"),"")

--- unicode ---

String to UNICODE (utf-16) binary HEX string

"hello world!".split("").reduce((hex,c)=>hex+=c.charCodeAt(0).toString(16).padStart(4,"0"),"")

UNICODE (utf-16) binary HEX string to string

"00680065006c006c006f00200077006f0072006c00640021".match(/.{1,4}/g).reduce((acc,char)=>acc+String.fromCharCode(parseInt(char, 16)),"")

Examples related to javascript

need to add a class to an element How to make a variable accessible outside a function? Hide Signs that Meteor.js was Used How to create a showdown.js markdown extension Please help me convert this script to a simple image slider Highlight Anchor Links when user manually scrolls? Summing radio input values How to execute an action before close metro app WinJS javascript, for loop defines a dynamic variable name Getting all files in directory with ajax

Examples related to unicode

How to resolve TypeError: can only concatenate str (not "int") to str (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape UnicodeEncodeError: 'ascii' codec can't encode character at special name Python NLTK: SyntaxError: Non-ASCII character '\xc3' in file (Sentiment Analysis -NLP) HTML for the Pause symbol in audio and video control Javascript: Unicode string to hex Concrete Javascript Regex for Accented Characters (Diacritics) Replace non-ASCII characters with a single space UTF-8 in Windows 7 CMD NameError: global name 'unicode' is not defined - in Python 3

Examples related to utf-8

error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte Changing PowerShell's default output encoding to UTF-8 'Malformed UTF-8 characters, possibly incorrectly encoded' in Laravel Encoding Error in Panda read_csv Using Javascript's atob to decode base64 doesn't properly decode utf-8 strings What is the difference between utf8mb4 and utf8 charsets in MySQL? what is <meta charset="utf-8">? Pandas df.to_csv("file.csv" encode="utf-8") still gives trash characters for minus sign UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 23: ordinal not in range(128) Android Studio : unmappable character for encoding UTF-8

Examples related to hex

Transparent ARGB hex value How to convert a hex string to hex number Javascript: Unicode string to hex Converting Hexadecimal String to Decimal Integer Convert string to hex-string in C# Print a variable in hexadecimal in Python Convert ascii char[] to hexadecimal char[] in C Hex transparency in colors printf() formatting for hex Python Hexadecimal