[java] How to convert a string with Unicode encoding to a string of letters

I have a string with escaped Unicode characters, \uXXXX, and I want to convert it to regular Unicode letters. For example:

"\u0048\u0065\u006C\u006C\u006F World"

should become

"Hello World"

I know that when I print the first string it already shows Hello world. My problem is I read file names from a file, and then I search for them. The files names in the file are escaped with Unicode encoding, and when I search for the files, I can't find them, since it searches for a file with \uXXXX in its name.

This question is related to java unicode encoding

The answer is


Shorter version:

public static String unescapeJava(String escaped) {
    if(escaped.indexOf("\\u")==-1)
        return escaped;

    String processed="";

    int position=escaped.indexOf("\\u");
    while(position!=-1) {
        if(position!=0)
            processed+=escaped.substring(0,position);
        String token=escaped.substring(position+2,position+6);
        escaped=escaped.substring(position+6);
        processed+=(char)Integer.parseInt(token,16);
        position=escaped.indexOf("\\u");
    }
    processed+=escaped;

    return processed;
}

The Apache Commons Lang StringEscapeUtils.unescapeJava() can decode it properly.

import org.apache.commons.lang.StringEscapeUtils;

@Test
public void testUnescapeJava() {
    String sJava="\\u0048\\u0065\\u006C\\u006C\\u006F";
    System.out.println("StringEscapeUtils.unescapeJava(sJava):\n" + StringEscapeUtils.unescapeJava(sJava));
}


 output:
 StringEscapeUtils.unescapeJava(sJava):
 Hello

Just wanted to contribute my version, using regex:

private static final String UNICODE_REGEX = "\\\\u([0-9a-f]{4})";
private static final Pattern UNICODE_PATTERN = Pattern.compile(UNICODE_REGEX);
...
String message = "\u0048\u0065\u006C\u006C\u006F World";
Matcher matcher = UNICODE_PATTERN.matcher(message);
StringBuffer decodedMessage = new StringBuffer();
while (matcher.find()) {
  matcher.appendReplacement(
      decodedMessage, String.valueOf((char) Integer.parseInt(matcher.group(1), 16)));
}
matcher.appendTail(decodedMessage);
System.out.println(decodedMessage.toString());

Solution for Kotlin:

val sourceContent = File("test.txt").readText(Charset.forName("windows-1251"))
val result = String(sourceContent.toByteArray())

Kotlin uses UTF-8 everywhere as default encoding.

Method toByteArray() has default argument - Charsets.UTF_8.


I found that many of the answers did not address the issue of "Supplementary Characters". Here is the correct way to support it. No third-party libraries, pure Java implementation.

http://www.oracle.com/us/technologies/java/supplementary-142654.html

public static String fromUnicode(String unicode) {
    String str = unicode.replace("\\", "");
    String[] arr = str.split("u");
    StringBuffer text = new StringBuffer();
    for (int i = 1; i < arr.length; i++) {
        int hexVal = Integer.parseInt(arr[i], 16);
        text.append(Character.toChars(hexVal));
    }
    return text.toString();
}

public static String toUnicode(String text) {
    StringBuffer sb = new StringBuffer();
    for (int i = 0; i < text.length(); i++) {
        int codePoint = text.codePointAt(i);
        // Skip over the second char in a surrogate pair
        if (codePoint > 0xffff) {
            i++;
        }
        String hex = Integer.toHexString(codePoint);
        sb.append("\\u");
        for (int j = 0; j < 4 - hex.length(); j++) {
            sb.append("0");
        }
        sb.append(hex);
    }
    return sb.toString();
}

@Test
public void toUnicode() {
    System.out.println(toUnicode(""));
    System.out.println(toUnicode(""));
    System.out.println(toUnicode("Hello World"));
}
// output:
// \u1f60a
// \u1f970
// \u0048\u0065\u006c\u006c\u006f\u0020\u0057\u006f\u0072\u006c\u0064

@Test
public void fromUnicode() {
    System.out.println(fromUnicode("\\u1f60a"));
    System.out.println(fromUnicode("\\u1f970"));
    System.out.println(fromUnicode("\\u0048\\u0065\\u006c\\u006c\\u006f\\u0020\\u0057\\u006f\\u0072\\u006c\\u0064"));
}
// output:
// 
// 
// Hello World

An alternate way of accomplishing this could be to make use of chars() introduced with Java 9, this can be used to iterate over the characters making sure any char which maps to a surrogate code point is passed through uninterpreted. This can be used as:-

String myString = "\u0048\u0065\u006C\u006C\u006F World";
myString.chars().forEach(a -> System.out.print((char)a));
// would print "Hello World"

Fast

 fun unicodeDecode(unicode: String): String {
        val stringBuffer = StringBuilder()
        var i = 0
        while (i < unicode.length) {
            if (i + 1 < unicode.length)
                if (unicode[i].toString() + unicode[i + 1].toString() == "\\u") {
                    val symbol = unicode.substring(i + 2, i + 6)
                    val c = Integer.parseInt(symbol, 16)
                    stringBuffer.append(c.toChar())
                    i += 5
                } else stringBuffer.append(unicode[i])
            i++
        }
        return stringBuffer.toString()
    }

I wrote a performanced and error-proof solution:

public static final String decode(final String in) {
    int p1 = in.indexOf("\\u");
    if (p1 < 0)
        return in;
    StringBuilder sb = new StringBuilder();
    while (true) {
        int p2 = p1 + 6;
        if (p2 > in.length()) {
            sb.append(in.subSequence(p1, in.length()));
            break;
        }
        try {
            int c = Integer.parseInt(in.substring(p1 + 2, p1 + 6), 16);
            sb.append((char) c);
            p1 += 6;
        } catch (Exception e) {
            sb.append(in.subSequence(p1, p1 + 2));
            p1 += 2;
        }
        int p0 = in.indexOf("\\u", p1);
        if (p0 < 0) {
            sb.append(in.subSequence(p1, in.length()));
            break;
        } else {
            sb.append(in.subSequence(p1, p0));
            p1 = p0;
        }
    }
    return sb.toString();
}

try

private static final Charset UTF_8 = Charset.forName("UTF-8");
private String forceUtf8Coding(String input) {return new String(input.getBytes(UTF_8), UTF_8))}

Actually, I wrote an Open Source library that contains some utilities. One of them is converting a Unicode sequence to String and vise-versa. I found it very useful. Here is the quote from the article about this library about Unicode converter:

Class StringUnicodeEncoderDecoder has methods that can convert a String (in any language) into a sequence of Unicode characters and vise-versa. For example a String "Hello World" will be converted into

"\u0048\u0065\u006c\u006c\u006f\u0020 \u0057\u006f\u0072\u006c\u0064"

and may be restored back.

Here is the link to entire article that explains what Utilities the library has and how to get the library to use it. It is available as Maven artifact or as source from Github. It is very easy to use. Open Source Java library with stack trace filtering, Silent String parsing Unicode converter and Version comparison


UnicodeUnescaper from org.apache.commons:commons-text is also acceptable.

new UnicodeUnescaper().translate("\u0048\u0065\u006C\u006C\u006F World") returns "Hello World"


Here is my solution...

                String decodedName = JwtJson.substring(startOfName, endOfName);

                StringBuilder builtName = new StringBuilder();

                int i = 0;

                while ( i < decodedName.length() )
                {
                    if ( decodedName.substring(i).startsWith("\\u"))
                    {
                        i=i+2;
                        builtName.append(Character.toChars(Integer.parseInt(decodedName.substring(i,i+4), 16)));
                        i=i+4;
                    }
                    else
                    {
                        builtName.append(decodedName.charAt(i));
                        i = i+1;
                    }
                };

It's not totally clear from your question, but I'm assuming you saying that you have a file where each line of that file is a filename. And each filename is something like this:

\u0048\u0065\u006C\u006C\u006F

In other words, the characters in the file of filenames are \, u, 0, 0, 4, 8 and so on.

If so, what you're seeing is expected. Java only translates \uXXXX sequences in string literals in source code (and when reading in stored Properties objects). When you read the contents you file you will have a string consisting of the characters \, u, 0, 0, 4, 8 and so on and not the string Hello.

So you will need to parse that string to extract the 0048, 0065, etc. pieces and then convert them to chars and make a string from those chars and then pass that string to the routine that opens the file.


Updates regarding answers suggesting using The Apache Commons Lang's: StringEscapeUtils.unescapeJava() - it was deprecated,

Deprecated. as of 3.6, use commons-text StringEscapeUtils instead

The replacement is Apache Commons Text's StringEscapeUtils.unescapeJava()


StringEscapeUtils from org.apache.commons.lang3 library is deprecated as of 3.6.

So you can use their new commons-text library instead:

compile 'org.apache.commons:commons-text:1.9'

OR

<dependency>
   <groupId>org.apache.commons</groupId>
   <artifactId>commons-text</artifactId>
   <version>1.9</version>
</dependency>

Example code:

org.apache.commons.text.StringEscapeUtils.unescapeJava(escapedString);

one easy way i know using JsonObject:

try {
    JSONObject json = new JSONObject();
    json.put("string", myString);
    String converted = json.getString("string");

} catch (JSONException e) {
    e.printStackTrace();
}

@NominSim There may be other character, so I should detect it by length.

private String forceUtf8Coding(String str) {
    str = str.replace("\\","");
    String[] arr = str.split("u");
    StringBuilder text = new StringBuilder();
    for(int i = 1; i < arr.length; i++){
        String a = arr[i];
        String b = "";
        if (arr[i].length() > 4){
            a = arr[i].substring(0, 4);
            b = arr[i].substring(4);
        }
        int hexVal = Integer.parseInt(a, 16);
        text.append((char) hexVal).append(b);
    }
    return text.toString();
}

You can use StringEscapeUtils from Apache Commons Lang, i.e.:

String Title = StringEscapeUtils.unescapeJava("\\u0048\\u0065\\u006C\\u006C\\u006F");


For Java 9+, you can use the new replaceAll method of Matcher class.

private static final Pattern UNICODE_PATTERN = Pattern.compile("\\\\u([0-9A-Fa-f]{4})");

public static String unescapeUnicode(String unescaped) {
    return UNICODE_PATTERN.matcher(unescaped).replaceAll(r -> String.valueOf((char) Integer.parseInt(r.group(1), 16)));
}

public static void main(String[] args) {
    String originalMessage = "\\u0048\\u0065\\u006C\\u006C\\u006F World";
    String unescapedMessage = unescapeUnicode(originalMessage);
    System.out.println(unescapedMessage);
}

I believe the main advantage of this approach over unescapeJava by StringEscapeUtils (besides not using an extra library) is that you can convert only the unicode characters (if you wish), since the latter converts all escaped Java characters (like \n or \t). If you prefer to convert all escaped characters the library is really the best option.


This simple method will work for most cases, but would trip up over something like "u005Cu005C" which should decode to the string "\u0048" but would actually decode "H" as the first pass produces "\u0048" as the working string which then gets processed again by the while loop.

static final String decode(final String in)
{
    String working = in;
    int index;
    index = working.indexOf("\\u");
    while(index > -1)
    {
        int length = working.length();
        if(index > (length-6))break;
        int numStart = index + 2;
        int numFinish = numStart + 4;
        String substring = working.substring(numStart, numFinish);
        int number = Integer.parseInt(substring,16);
        String stringStart = working.substring(0, index);
        String stringEnd   = working.substring(numFinish);
        working = stringStart + ((char)number) + stringEnd;
        index = working.indexOf("\\u");
    }
    return working;
}

Examples related to java

Under what circumstances can I call findViewById with an Options Menu / Action Bar item? How much should a function trust another function How to implement a simple scenario the OO way Two constructors How do I get some variable from another class in Java? this in equals method How to split a string in two and store it in a field How to do perspective fixing? String index out of range: 4 My eclipse won't open, i download the bundle pack it keeps saying error log

Examples related to unicode

How to resolve TypeError: can only concatenate str (not "int") to str (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape UnicodeEncodeError: 'ascii' codec can't encode character at special name Python NLTK: SyntaxError: Non-ASCII character '\xc3' in file (Sentiment Analysis -NLP) HTML for the Pause symbol in audio and video control Javascript: Unicode string to hex Concrete Javascript Regex for Accented Characters (Diacritics) Replace non-ASCII characters with a single space UTF-8 in Windows 7 CMD NameError: global name 'unicode' is not defined - in Python 3

Examples related to encoding

How to check encoding of a CSV file UnicodeEncodeError: 'ascii' codec can't encode character at special name Using Javascript's atob to decode base64 doesn't properly decode utf-8 strings What is the difference between utf8mb4 and utf8 charsets in MySQL? The character encoding of the plain text document was not declared - mootool script UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 23: ordinal not in range(128) How to encode text to base64 in python UTF-8 output from PowerShell Set Encoding of File to UTF8 With BOM in Sublime Text 3 Replace non-ASCII characters with a single space