How to unescape a Java string literal in Java

Question

I m processing some Java source code using Java  I m extracting the string literals and feeding them to a function taking a String  The problem is that I need to pass the unescaped version of the String to the function  i e  this means converting  n to a newline  and    to a single    etc    Is there a function inside the Java API that does this  If not  can I obtain such functionality from some library  Obviously the Java compiler has to do this conversion   In case anyone wants to know  I m trying to un-obfuscate string literals in decompiled obfuscated Java files

User · Accepted Answer

The Problem  The org apache commons lang StringEscapeUtils unescapeJava   given here as another answer is really very little help at all    It forgets about  0 for null  It doesn   t handle octal at all   It can   t handle the sorts of escapes admitted by the java util regex Pattern compile   and everything that uses it  including  a   e  and especially  cX   It has no support for logical Unicode code points by number  only for UTF-16  This looks like UCS-2 code  not UTF-16 code  they use the depreciated charAt interface instead of the codePoint interface  thus promulgating the delusion that a Java char is guaranteed to hold a Unicode character  It   s not  They only get away with this because no UTF-16 surrogate will wind up looking for anything they   re looking for     The Solution  I wrote a string unescaper which solves the OP   s question without all the irritations of the Apache code            unescape perl string              Tom Christiansen  lt tchrist perl com gt          Sun Nov 28 12 55 24 MST 2010       It s completely ridiculous that there s no standard    unescape java string function   Since I have to do the    damn thing myself  I might as well make it halfway useful    by supporting things Java was too stupid to consider in    strings             gt      items  are additions to Java string escapes                    but normal in Java regexes           gt      items  are also additions to Java regex escapes          Standard singletons    a   e  f  n  r  t             NB   b is unsupported as backspace so it can pass-through             to the regex translator untouched  I refuse to make anyone             doublebackslash it as doublebackslashing is a Java idiocy             I desperately wish would die out   There are plenty of             other ways to write it                      cH   12   012   x08  x 8    u0008   U00000008       Octal escapes   0  0N  0NN  N  NN  NNN       Can range up to   777 not  377                TODO  add   o NNNNN              last Unicode is 4177777             maxint is 37777777777       Control chars    cX         Means  ord X    ord            Old hex escapes   xXX         unbraced must be 2 xdigits       Perl hex escapes    x XXX  braced may be 1-8 xdigits          NB  proper Unicode never needs more than 6  as highest              valid codepoint is 0x10FFFF  not maxint 0xFFFFFFFF       Lame Java escape    IDIOT JAVA PREPROCESSOR uXXXX must be                      exactly 4 xdigits              I can t write XXXX in this comment where it belongs          because the damned Java Preprocessor can t mind its          own business   Idiots        Lame Python escape    UXXXXXXXX must be exactly 8 xdigits        TODO  Perl translation escapes   Q  U  L  E   IDIOT JAVA PREPROCESSOR u  l          These are not so important to cover if you re passing the          result to Pattern compile    since it handles them for you          further downstream   Hm  what about   IDIOT JAVA PREPROCESSOR u          public final static String unescape perl string String oldstr                   In contrast to fixing Java s broken regex charclasses         this one need be no bigger  as unescaping shrinks the string        here  where in the other one  it grows it               StringBuffer newstr   new StringBuffer oldstr length          boolean saw backslash   false       for  int i   0  i  lt  oldstr length    i              int cp   oldstr codePointAt i           if  oldstr codePointAt i   gt  Character MAX VALUE                i         WE HATES UTF-16  WE HATES IT FOREVERSES                            if   saw backslash                if  cp                            saw backslash   true                else                   newstr append Character toChars cp                              continue     switch                       if  cp                        saw backslash   false              newstr append                    newstr append                    continue     switch                       switch  cp                 case  r    newstr append   r                           break     switch                 case  n    newstr append   n                           break     switch                 case  f    newstr append   f                           break     switch                    PASS a  b THROUGH                  case  b    newstr append    b                           break     switch                 case  t    newstr append   t                           break     switch                 case  a    newstr append   007                           break     switch                 case  e    newstr append   033                           break     switch                                   A  control  character is what you get when you xor its                codepoint with      64   This only makes sense for ASCII                 and may not yield a  control  character after all                                Strange but true    c   is        c   is      etc                              case  c                       if    i    oldstr length      die  trailing   c                      cp   oldstr codePointAt i                                         don t need to grok surrogates  as next line blows them up                                     if  cp  gt  0x7f    die  expected ASCII after   c                      newstr append Character toChars cp   64                    break     switch                               case  8               case  9   die  illegal octal digit                             NOTREACHED                   may be 0 to 2 octal digits following this one        so back up one for fallthrough to next case         unread this digit and fall through to next case                      case  1               case  2               case  3               case  4               case  5               case  6               case  7   --i                           FALLTHROUGH                                   Can have 0  1  or 2 octal digits following a 0                this permits larger values than octal 377  up to                octal 777                              case  0                     if  i 1    oldstr length                             found  0 at end of string                        newstr append Character toChars 0                        break     switch                                      i                    int digits   0                  int j                  for  j   0  j  lt   2  j                          if  i j    oldstr length                              break     for                                                 safe because will unread surrogate                        int ch   oldstr charAt i j                       if  ch  lt   0     ch  gt   7                             break     for                                              digits                                      if  digits    0                        --i                      newstr append   0                        break     switch                                      int value   0                  try                       value   Integer parseInt                                  oldstr substring i  i digits   8                     catch  NumberFormatException nfe                        die  invalid octal value for   0 escape                                      newstr append Character toChars value                    i    digits-1                  break     switch                     end case  0                  case  x                      if  i 2  gt  oldstr length                          die  string too short for   x escape                                      i                    boolean saw brace   false                  if  oldstr charAt i                                              ok to ignore surrogates here                        i                        saw brace   true                                    int j                  for  j   0  j  lt  8  j                           if   saw brace  amp  amp  j    2                            break      for                                                                         ASCII test also catches surrogates                                             int ch   oldstr charAt i j                       if  ch  gt  127                            die  illegal non-ASCII hex digit in   x escape                                               if  saw brace  amp  amp  ch           break     for                           if       ch  gt    0   amp  amp  ch  lt    9                                                                   ch  gt    a   amp  amp  ch  lt    f                                                                   ch  gt    A   amp  amp  ch  lt    F                                                                                                      die String format                               illegal hex digit   d   c  in   x   ch  ch                                                             if  j    0    die  empty braces in   x   escape                      int value   0                  try                       value   Integer parseInt oldstr substring i  i j   16                     catch  NumberFormatException nfe                        die  invalid hex value for   x escape                                      newstr append Character toChars value                    if  saw brace    j                      i    j-1                  break     switch                               case  u                     if  i 4  gt  oldstr length                          die  string too short for   u escape                                      i                    int j                  for  j   0  j  lt  4  j                             this also handles the surrogate issue                        if  oldstr charAt i j   gt  127                            die  illegal non-ASCII hex digit in   u escape                                                            int value   0                  try                       value   Integer parseInt  oldstr substring i  i j   16                     catch  NumberFormatException nfe                        die  invalid hex value for   u escape                                      newstr append Character toChars value                    i    j-1                  break     switch                               case  U                     if  i 8  gt  oldstr length                          die  string too short for   U escape                                      i                    int j                  for  j   0  j  lt  8  j                             this also handles the surrogate issue                        if  oldstr charAt i j   gt  127                            die  illegal non-ASCII hex digit in   U escape                                                            int value   0                  try                       value   Integer parseInt oldstr substring i  i j   16                     catch  NumberFormatException nfe                        die  invalid hex value for   U escape                                      newstr append Character toChars value                    i    j-1                  break     switch                               default    newstr append                               newstr append Character toChars cp                                say String format                       DEFAULT unrecognized escape  c passed through                       cp                                          break     switch                       saw backslash   false                weird to leave one at the end        if  saw backslash            newstr append                   return newstr toString             Return a string  U XX XXX XXXX  etc  where each XX set is the    xdigits of the logical Unicode code point  No bloody brain-damaged    UTF-16 surrogate crap  just true logical characters       public final static  String uniplus String s         if  s length      0             return                    This is just the minimum  sb will grow as needed          StringBuffer sb   new StringBuffer 2   3   s length          sb append  U          for  int i   0  i  lt  s length    i               sb append String format   X   s codePointAt i              if  s codePointAt i   gt  Character MAX VALUE                 i         WE HATES UTF-16  WE HATES IT FOREVERSES                             if  i 1  lt  s length                   sb append                              return sb toString        private static final void die String foa        throw new IllegalArgumentException foa      private static final void say String what        System out println what       If it helps others  you   re welcome to it     no strings attached  If you improve it  I   d love for you to mail me your enhancements  but you certainly don   t have to

User · Answer

org apache commons lang3 StringEscapeUtils from commons-lang3 is marked deprecated now  You can use org apache commons text StringEscapeUtils unescapeJava String  instead  It requires an additional Maven dependency            lt dependency gt               lt groupId gt org apache commons lt  groupId gt               lt artifactId gt commons-text lt  artifactId gt               lt version gt 1 4 lt  version gt           lt  dependency gt    and seems to handle some more special cases  it e g  unescapes    escaped backslashes  single and double quotes escaped octal and unicode values   b    n    t    f    r

User · Answer

See this from http   commons apache org lang    StringEscapeUtils  StringEscapeUtils unescapeJava String str

User · Answer

Java 13 added a method which does this  String translateEscapes  It was a preview feature in Java 13 and 14  but was promoted to a full feature in Java 15

User · Answer

I came across the same problem  but I wasn t enamoured by any of the solutions I found here  So  I wrote one that iterates over the characters of the string using a matcher to find and replace the escape sequences  This solution assumes properly formatted input  That is  it happily skips over nonsensical escapes  and it decodes Unicode escapes for line feed and carriage return  which otherwise cannot appear in a character literal or a string literal  due to the definition of such literals and the order of translation phases for Java source   Apologies  the code is a bit packed for brevity   import java util Arrays  import java util regex Matcher  import java util regex Pattern   public class Decoder           The encoded character of each character escape         This array functions as the keys of a sorted map  from encoded characters to decoded characters      static final char   ENCODED ESCAPES                         b     f     n     r     t             The decoded character of each character escape         This array functions as the values of a sorted map  from encoded characters to decoded characters      static final char   DECODED ESCAPES                         b     f     n     r     t             A pattern that matches an escape         What follows the escape indicator is captured by group 1 character 2 octal 3 Unicode      static final Pattern PATTERN   Pattern compile          b t n f r                      0-3   0-7    0-7   u    p XDigit  4            public static CharSequence decodeString CharSequence encodedString            Matcher matcher   PATTERN matcher encodedString           StringBuffer decodedString   new StringBuffer               Find each escape of the encoded string in succession          while  matcher find                  char ch              if  matcher start 1   gt   0                       Decode a character escape                  ch   DECODED ESCAPES Arrays binarySearch ENCODED ESCAPES  matcher group 1  charAt 0                   else if  matcher start 2   gt   0                       Decode an octal escape                  ch    char  Integer parseInt matcher group 2   8                  else    if  matcher start 3   gt   0                          Decode a Unicode escape                  ch    char  Integer parseInt matcher group 3   16                                 Replace the escape with the decoded character              matcher appendReplacement decodedString  Matcher quoteReplacement String valueOf ch                          Append the remainder of the encoded string to the decoded string             The remainder is the longest suffix of the encoded string such that the suffix contains no escapes          matcher appendTail decodedString           return decodedString             public static void main String    args            System out println decodeString args 0               I should note that Apache Commons Lang3 doesn t seem to suffer the weaknesses indicated in the accepted solution  That is  StringEscapeUtils seems to handle octal escapes and multiple u characters of Unicode escapes  That means unless you have some burning reason to avoid Apache Commons  you should probably use it rather than my solution  or any other solution here

User · Answer

For the record  if you use Scala  you can do   StringContext treatEscapes escaped

User · Answer

I know this question was old  but I wanted a solution that doesn t involve libraries outside those included JRE6  i e  Apache Commons is not acceptable   and I came up with a simple solution using the built-in java io StreamTokenizer   import java io             String literal      Has            t      amp  isn    t    r   n on 1 line      StreamTokenizer parser   new StreamTokenizer new StringReader literal    String result  try     parser nextToken      if  parser ttype               result   parser sval        else       result    ERROR          catch  IOException e      result   e toString      System out println result     Output   Has        amp  isn t  on 1 line

User · Answer

If you are reading unicode escaped chars from a file  then you will have a tough time doing that because the string will be read literally along with an escape for the back slash   my file txt  Blah blah    Column delimiter   Word delimiter  u0020  This is just unicode for whitespace     more stuff   Here  when you read line 3 from the file the string line will have    Word delimiter  u0020  This is just unicode for whitespace    and the char   in the string will show                      u    0    0    2    0              t    h          Commons StringUnescape will not unescape this for you  I tried unescapeXml     You ll have to do it manually as described here   So  the sub-string   u0020  should become 1 single char   u0020   But if you are using this   u0020  to do String split                 columnDelimiterReadFromFile  which is really using regex internally  it will work directly because the string read from file was escaped and is perfect to use in the regex pattern    Confused

User · Answer

I m a little late on this  but I thought I d provide my solution since I needed the same functionality  I decided to use the Java Compiler API which makes it slower  but makes the results accurate  Basically I live create a class then return the results  Here is the method   public static String   unescapeJavaStrings String    escaped          class name     final String className    Temp    System currentTimeMillis          build the source     final StringBuilder source   new StringBuilder 100   escaped length   20               append  public class    append className  append    n                append   tpublic static String   getStrings     n                append   t treturn new String     n        for  String string   escaped            source append   t t t                we escape non-escaped quotes here to be safe               but something like     will fail  oh well for now          for  int i   0  i  lt  string length    i                  char chr   string charAt i               if  chr         amp  amp  i  gt  0  amp  amp  string charAt i - 1                             source append                                  source append chr                     source append      n              source append   t t   n t  n  n          obtain compiler     final JavaCompiler compiler   ToolProvider getSystemJavaCompiler          local stream for output     final ByteArrayOutputStream out   new ByteArrayOutputStream          local stream for error     ByteArrayOutputStream err   new ByteArrayOutputStream          source file     JavaFileObject sourceFile   new SimpleJavaFileObject              URI create  string        className   Kind SOURCE extension   Kind SOURCE             Override         public CharSequence getCharContent boolean ignoreEncodingErrors  throws IOException               return source                         target file     final JavaFileObject targetFile   new SimpleJavaFileObject              URI create  string        className   Kind CLASS extension   Kind CLASS             Override         public OutputStream openOutputStream   throws IOException               return out                         file manager proxy  with most parts delegated to the standard one      JavaFileManager fileManagerProxy    JavaFileManager  Proxy newProxyInstance              StringUtils class getClassLoader    new Class     JavaFileManager class                new InvocationHandler                       standard file manager to delegate to                 private final JavaFileManager standard                        compiler getStandardFileManager null  null  null                     Override                 public Object invoke Object proxy  Method method  Object   args  throws Throwable                       if   getJavaFileForOutput  equals method getName                                 return the target file when it s asking for output                         return targetFile                        else                           return method invoke standard  args                                                                 create the task     CompilationTask task   compiler getTask new OutputStreamWriter err                fileManagerProxy  null  null  null  Collections singleton sourceFile          call it     if   task call              throw new RuntimeException  Compilation failed  output  n                     new String err toByteArray                  get the result     final byte   bytes   out toByteArray          load class     Class lt   gt  clazz      try             custom class loader for garbage collection         clazz   new ClassLoader                  protected Class lt   gt  findClass String name  throws ClassNotFoundException                   if  name equals className                         return defineClass className  bytes  0  bytes length                     else                       return super findClass name                                             loadClass className         catch  ClassNotFoundException e            throw new RuntimeException e               reflectively call method     try           return  String    clazz getDeclaredMethod  getStrings   invoke null         catch  Exception e            throw new RuntimeException e             It takes an array so you can unescape in batches  So the following simple test succeeds   public static void main String   meh        if   1 02 03 n  equals unescapeJavaStrings  1  02  03  n   0              System out println  Success          else           System out println  Failure

User · Answer

You can use String unescapeJava String  method of StringEscapeUtils from Apache Commons Lang  Here s an example snippet      String in    quot a  tb  n    quot c    quot  quot        System out println in          a tb n  quot c  quot       String out   StringEscapeUtils unescapeJava in        System out println out          a    b         quot c quot   The utility class has methods to escapes and unescape strings for Java  Java Script  HTML  XML  and SQL  It also has overloads that writes directly to a java io Writer   Caveats It looks like StringEscapeUtils handles Unicode escapes with one u  but not octal escapes  or Unicode escapes with extraneous us         Unicode escape test  1  PASS             System out println           quot  u0030 quot            0     System out println          StringEscapeUtils unescapeJava  quot   u0030 quot             0     System out println           quot  u0030 quot  equals StringEscapeUtils unescapeJava  quot   u0030 quot              true             Octal escape test  FAIL             System out println           quot  45 quot                  System out println          StringEscapeUtils unescapeJava  quot   45 quot             45     System out println           quot  45 quot  equals StringEscapeUtils unescapeJava  quot   45 quot              false         Unicode escape test  2  FAIL             System out println           quot  uu0030 quot            0     System out println          StringEscapeUtils unescapeJava  quot   uu0030 quot             throws NestableRuntimeException              Unable to parse unicode value  u003  A quote from the JLS   Octal escapes are provided for compatibility with C  but can express only Unicode values  u0000 through  u00FF  so Unicode escapes are usually preferred   If your string can contain octal escapes  you may want to convert them to Unicode escapes first  or use another approach  The extraneous u is also documented as follows   The Java programming language specifies a standard way of transforming a program written in Unicode into ASCII that changes a program into a form that can be processed by ASCII-based tools  The transformation involves converting any Unicode escapes in the source text of the program to ASCII by adding an extra u-for example   uxxxx becomes  uuxxxx-while simultaneously converting non-ASCII characters in the source text to Unicode escapes containing a single u each  This transformed version is equally acceptable to a compiler for the Java programming language and represents the exact same program  The exact Unicode source can later be restored from this ASCII form by converting each escape sequence where multiple u s are present to a sequence of Unicode characters with one fewer u  while simultaneously converting each escape sequence with a single u to the corresponding single Unicode character   If your string can contain Unicode escapes with extraneous u  then you may also need to preprocess this before using StringEscapeUtils  Alternatively you can try to write your own Java string literal unescaper from scratch  making sure to follow the exact JLS specifications  References  JLS 3 3 Unicode Escapes JLS 3 10 6 Escape Sequences for Character and String Literals

User · Answer

Came across a similar problem  wasn t also satisfied with the presented solutions and implemented this one myself   Also available as a Gist on Github          Unescapes a string that contains standard Java escape sequences      lt ul gt      lt li gt  lt strong gt  amp  92 b  amp  92 f  amp  92 n  amp  92 r  amp  92 t  amp  92    amp  92   lt  strong gt       BS  FF  NL  CR  TAB  double and single quote  lt  li gt      lt li gt  lt strong gt  amp  92 X  amp  92 XX  amp  92 XXX lt  strong gt    Octal character    specification  0 - 377  0x00 - 0xFF   lt  li gt      lt li gt  lt strong gt  amp  92 uXXXX lt  strong gt    Hexadecimal based Unicode character  lt  li gt      lt  ul gt          param st               A string optionally containing standard java escape sequences      return The translated string      public String unescapeJavaString String st         StringBuilder sb   new StringBuilder st length          for  int i   0  i  lt  st length    i              char ch   st charAt i           if  ch                        char nextChar    i    st length   - 1           st                      charAt i   1                  Octal escape              if  nextChar  gt    0   amp  amp  nextChar  lt    7                     String code        nextChar                  i                    if   i  lt  st length   - 1   amp  amp  st charAt i   1   gt    0                           amp  amp  st charAt i   1   lt    7                         code    st charAt i   1                       i                        if   i  lt  st length   - 1   amp  amp  st charAt i   1   gt    0                               amp  amp  st charAt i   1   lt    7                             code    st charAt i   1                           i                                                            sb append  char  Integer parseInt code  8                    continue                            switch  nextChar                case                       ch                         break              case  b                   ch     b                   break              case  f                   ch     f                   break              case  n                   ch     n                   break              case  r                   ch     r                   break              case  t                   ch     t                   break              case                       ch                         break              case                       ch                         break                 Hex Unicode  u                 case  u                   if  i  gt   st length   - 5                        ch    u                       break                                    int code   Integer parseInt                               st charAt i   2    st charAt i   3                                    st charAt i   4    st charAt i   5   16                   sb append Character toChars code                    i    5                  continue                            i                      sb append ch             return sb toString

[java] How to unescape a Java string literal in Java?

The Problem

The Solution

Examples related to java

Examples related to string

Examples related to escaping