How to convert a string with Unicode encoding to a string of letters

Question

I have a string with escaped Unicode characters   uXXXX  and I want to convert it to regular Unicode letters  For example     u0048 u0065 u006C u006C u006F World    should become   Hello World    I know that when I print the first string it already shows Hello world  My problem is I read file names from a file  and then I search for them  The files names in the file are escaped with Unicode encoding  and when I search for the files  I can t find them  since it searches for a file with  uXXXX in its name

User · Accepted Answer

Technically doing   String myString     u0048 u0065 u006C u006C u006F World     automatically converts it to  Hello World   so I assume you are reading in the string from some file  In order to convert it to  Hello  you ll have to parse the text into the separate unicode digits   take the  uXXXX and just get XXXX  then do Integer ParseInt XXXX  16  to get a hex value and then case that to char to get the actual character   Edit  Some code to accomplish this   String str   myString split      0   str   str replace           String   arr   str split  u    String text       for int i   1  i  lt  arr length  i         int hexVal   Integer parseInt arr i   16       text     char hexVal       Text will now have Hello

User · Answer

I wrote a performanced and error-proof solution   public static final String decode final String in        int p1   in indexOf    u        if  p1  lt  0          return in      StringBuilder sb   new StringBuilder        while  true            int p2   p1   6          if  p2  gt  in length                  sb append in subSequence p1  in length                  break                    try               int c   Integer parseInt in substring p1   2  p1   6   16               sb append  char  c               p1    6            catch  Exception e                sb append in subSequence p1  p1   2                p1    2                    int p0   in indexOf    u   p1           if  p0  lt  0                sb append in subSequence p1  in length                  break            else               sb append in subSequence p1  p0                p1   p0                      return sb toString

User · Answer

Actually  I wrote an Open Source library that contains some utilities  One of them is converting a Unicode sequence to String and vise-versa  I found it very useful  Here is the quote from the article about this library about Unicode converter      Class StringUnicodeEncoderDecoder has methods that can convert a   String  in any language  into a sequence of Unicode characters and   vise-versa  For example a String  Hello World  will be converted into        u0048 u0065 u006c u006c u006f u0020  u0057 u006f u0072 u006c u0064       and may be restored back    Here is the link to entire article that explains what Utilities the library has and how to get the library to use it  It is available as Maven artifact or as source from Github  It is very easy to use  Open Source Java library with stack trace filtering  Silent String parsing Unicode converter and Version comparison

User · Answer

try  private static final Charset UTF 8   Charset forName  UTF-8    private String forceUtf8Coding String input   return new String input getBytes UTF 8   UTF 8

User · Answer

An alternate way of accomplishing this could be to make use of chars   introduced with Java 9  this can be used to iterate over the characters making sure any char which maps to a surrogate code point is passed through uninterpreted  This can be used as -  String myString     u0048 u0065 u006C u006C u006F World   myString chars   forEach a - gt  System out print  char a       would print  Hello World

User · Answer

The Apache Commons Lang StringEscapeUtils unescapeJava   can decode it properly    import org apache commons lang StringEscapeUtils    Test public void testUnescapeJava         String sJava    u0048  u0065  u006C  u006C  u006F       System out println  StringEscapeUtils unescapeJava sJava   n    StringEscapeUtils unescapeJava sJava         output   StringEscapeUtils unescapeJava sJava    Hello

User · Answer

It s not totally clear from your question  but I m assuming you saying that you have a file where each line of that file is a filename   And each filename is something like this    u0048 u0065 u006C u006C u006F   In other words  the characters in the file of filenames are    u  0  0  4  8 and so on   If so  what you re seeing is expected   Java only translates  uXXXX sequences in string literals in source code  and when reading in stored Properties objects    When you read the contents you file you will have a string consisting of the characters    u  0  0  4  8 and so on and not the string Hello   So you will need to parse that string to extract the 0048  0065  etc  pieces and then convert them to chars and make a string from those chars and then pass that string to the routine that opens the file

User · Answer

UnicodeUnescaper from org apache commons commons-text is also acceptable  new UnicodeUnescaper   translate  quot  u0048 u0065 u006C u006C u006F World quot   returns  quot Hello World quot

User · Answer

Fast   fun unicodeDecode unicode  String   String           val stringBuffer   StringBuilder           var i   0         while  i  lt  unicode length                if  i   1  lt  unicode length                  if  unicode i  toString     unicode i   1  toString         u                         val symbol   unicode substring i   2  i   6                      val c   Integer parseInt symbol  16                      stringBuffer append c toChar                        i    5                   else stringBuffer append unicode i               i                     return stringBuffer toString

User · Answer

This simple method will work for most cases  but would trip up over something like  u005Cu005C  which should decode to the string   u0048  but would actually decode  H  as the first pass produces   u0048  as the working string which then gets processed again by the while loop   static final String decode final String in        String working   in      int index      index   working indexOf    u        while index  gt  -1                int length   working length            if index  gt   length-6  break          int numStart   index   2          int numFinish   numStart   4          String substring   working substring numStart  numFinish           int number   Integer parseInt substring 16           String stringStart   working substring 0  index           String stringEnd     working substring numFinish           working   stringStart     char number    stringEnd          index   working indexOf    u              return working

User · Answer

StringEscapeUtils from org apache commons lang3 library is deprecated as of 3 6  So you can use their new commons-text library instead  compile  org apache commons commons-text 1 9   OR   lt dependency gt      lt groupId gt org apache commons lt  groupId gt      lt artifactId gt commons-text lt  artifactId gt      lt version gt 1 9 lt  version gt   lt  dependency gt   Example code  org apache commons text StringEscapeUtils unescapeJava escapedString

User · Answer

For Java 9   you can use the new replaceAll method of Matcher class  private static final Pattern UNICODE PATTERN   Pattern compile  quot     u  0-9A-Fa-f  4   quot     public static String unescapeUnicode String unescaped        return UNICODE PATTERN matcher unescaped  replaceAll r - gt  String valueOf  char  Integer parseInt r group 1   16        public static void main String   args        String originalMessage    quot   u0048  u0065  u006C  u006C  u006F World quot       String unescapedMessage   unescapeUnicode originalMessage       System out println unescapedMessage      I believe the main advantage of this approach over unescapeJava by StringEscapeUtils  besides not using an extra library  is that you can convert only the unicode characters  if you wish   since the latter converts all escaped Java characters  like  n or  t   If you prefer to convert all escaped characters the library is really the best option

User · Answer

Here is my solution                     String decodedName   JwtJson substring startOfName  endOfName                    StringBuilder builtName   new StringBuilder                     int i   0                   while   i  lt  decodedName length                                           if   decodedName substring i  startsWith    u                                                  i i 2                          builtName append Character toChars Integer parseInt decodedName substring i i 4   16                             i i 4                                            else                                               builtName append decodedName charAt i                            i   i 1

User · Answer

Just wanted to contribute my version  using regex   private static final String UNICODE REGEX        u  0-9a-f  4     private static final Pattern UNICODE PATTERN   Pattern compile UNICODE REGEX       String message     u0048 u0065 u006C u006C u006F World   Matcher matcher   UNICODE PATTERN matcher message   StringBuffer decodedMessage   new StringBuffer    while  matcher find        matcher appendReplacement        decodedMessage  String valueOf  char  Integer parseInt matcher group 1   16       matcher appendTail decodedMessage   System out println decodedMessage toString

User · Answer

Updates regarding answers suggesting using The Apache Commons Lang s   StringEscapeUtils unescapeJava   - it was deprecated       Deprecated     as of 3 6  use commons-text StringEscapeUtils instead   The replacement is Apache Commons Text s StringEscapeUtils unescapeJava

User · Answer

I found that many of the answers did not address the issue of  Supplementary Characters   Here is the correct way to support it  No third-party libraries  pure Java implementation   http   www oracle com us technologies java supplementary-142654 html  public static String fromUnicode String unicode        String str   unicode replace                String   arr   str split  u        StringBuffer text   new StringBuffer        for  int i   1  i  lt  arr length  i              int hexVal   Integer parseInt arr i   16           text append Character toChars hexVal              return text toString       public static String toUnicode String text        StringBuffer sb   new StringBuffer        for  int i   0  i  lt  text length    i              int codePoint   text codePointAt i              Skip over the second char in a surrogate pair         if  codePoint  gt  0xffff                i                      String hex   Integer toHexString codePoint           sb append    u            for  int j   0  j  lt  4 - hex length    j                  sb append  0                      sb append hex             return sb toString        Test public void toUnicode         System out println toUnicode           System out println toUnicode           System out println toUnicode  Hello World          output      u1f60a     u1f970     u0048 u0065 u006c u006c u006f u0020 u0057 u006f u0072 u006c u0064   Test public void fromUnicode         System out println fromUnicode    u1f60a         System out println fromUnicode    u1f970         System out println fromUnicode    u0048  u0065  u006c  u006c  u006f  u0020  u0057  u006f  u0072  u006c  u0064          output             Hello World

User · Answer

You can use StringEscapeUtils from Apache Commons Lang  i e    String Title   StringEscapeUtils unescapeJava    u0048  u0065  u006C  u006C  u006F

User · Answer

one easy way i know using JsonObject   try       JSONObject json   new JSONObject        json put  string   myString       String converted   json getString  string       catch  JSONException e        e printStackTrace

User · Answer

NominSim There may be other character  so I should detect it by length  private String forceUtf8Coding String str        str   str replace  quot    quot   quot  quot        String   arr   str split  quot u quot        StringBuilder text   new StringBuilder        for int i   1  i  lt  arr length  i             String a   arr i           String b    quot  quot           if  arr i  length    gt  4               a   arr i  substring 0  4               b   arr i  substring 4                     int hexVal   Integer parseInt a  16           text append  char  hexVal  append b             return text toString

User · Answer

Solution for Kotlin   val sourceContent   File  test txt   readText Charset forName  windows-1251    val result   String sourceContent toByteArray      Kotlin uses UTF-8 everywhere as default encoding   Method toByteArray   has default argument - Charsets UTF 8

User · Answer

Shorter version   public static String unescapeJava String escaped        if escaped indexOf    u    -1          return escaped       String processed          int position escaped indexOf    u        while position  -1            if position  0              processed  escaped substring 0 position           String token escaped substring position 2 position 6           escaped escaped substring position 6           processed   char Integer parseInt token 16           position escaped indexOf    u              processed  escaped       return processed

[java] How to convert a string with Unicode encoding to a string of letters

Examples related to java

Examples related to unicode

Examples related to encoding