How to unescape HTML character entities in Java

Question

Basically I would like to decode a given Html document  and replace all special chars  such as   amp nbsp   -         amp gt   -    gt     In  NET we can make use of HttpUtility HtmlDecode    What s the equivalent function in Java

User · Accepted Answer

I have used the Apache Commons StringEscapeUtils.unescapeHtml4() for this:

Unescapes a string containing entity escapes to a string containing the actual Unicode characters corresponding to the escapes. Supports HTML 4.0 entities.

User · Answer

In my case i use the replace method by testing every entity in every variable  my code looks like this   text   text replace   amp Ccedil           text   text replace   amp ccedil           text   text replace   amp Aacute           text   text replace   amp Acirc           text   text replace   amp Atilde           text   text replace   amp Eacute           text   text replace   amp Ecirc           text   text replace   amp Iacute           text   text replace   amp Ocirc           text   text replace   amp Otilde           text   text replace   amp Oacute           text   text replace   amp Uacute           text   text replace   amp aacute           text   text replace   amp acirc           text   text replace   amp atilde           text   text replace   amp eacute           text   text replace   amp ecirc           text   text replace   amp iacute           text   text replace   amp ocirc           text   text replace   amp otilde           text   text replace   amp oacute           text   text replace   amp uacute             In my case this worked very well

User · Answer

Consider using the HtmlManipulator Java class  You may need to add some items  not all entities are in the list     The Apache Commons StringEscapeUtils as suggested by Kevin Hakanson did not work 100  for me  several entities like  amp  145  left single quote  were translated into  222  somehow  I also tried org jsoup  and had the same problem

User · Answer

A very simple but inefficient solution without any external library is   public static String unescapeHtml3  String str         try           HTMLDocument doc   new HTMLDocument            new HTMLEditorKit   read  new StringReader    lt html gt  lt body gt     str    doc  0            return doc getText  1  doc getLength            catch  Exception ex             return str            This should be use only if you have only small count of string to decode

User · Answer

Spring Framework HtmlUtils  If you re using Spring framework already  use the following method   import static org springframework web util HtmlUtils htmlUnescape        String result   htmlUnescape source

User · Answer

I tried Apache Commons StringEscapeUtils unescapeHtml3   in my project  but wasn t satisfied with its performance  Turns out  it does a lot of unnecessary operations  For one  it allocates a StringWriter for every call  even if there s nothing to unescape in the string   I ve rewritten that code differently  now it works much faster  Whoever finds this in google is welcome to use it   Following code unescapes all HTML 3 symbols and numeric escapes  equivalent to Apache unescapeHtml3   You can just add more entries to the map if you need HTML 4   package com example   import java io StringWriter  import java util HashMap   public class StringUtils        public static final String unescapeHtml3 final String input            StringWriter writer   null          int len   input length            int i   1          int st   0          while  true                   look for   amp               while  i  lt  len  amp  amp  input charAt i-1       amp                    i                if  i  gt   len                  break                  found   amp    look for                 int j   i              while  j  lt  len  amp  amp  j  lt  i   MAX ESCAPE   1  amp  amp  input charAt j                          j                if  j    len    j  lt  i   MIN ESCAPE    j    i   MAX ESCAPE   1                    i                    continue                                found escape              if  input charAt i                               numeric escape                 int k   i   1                  int radix   10                   final char firstChar   input charAt k                   if  firstChar     x     firstChar     X                         k                        radix   16                                     try                       int entityValue   Integer parseInt input substring k  j   radix                        if  writer    null                           writer   new StringWriter input length                         writer append input substring st  i - 1                         if  entityValue  gt  0xFFFF                            final char   chrs   Character toChars entityValue                           writer write chrs 0                            writer write chrs 1                          else                           writer write entityValue                                            catch  NumberFormatException ex                         i                        continue                                              else                      named escape                 CharSequence value   lookupMap get input substring i  j                    if  value    null                        i                        continue                                     if  writer    null                       writer   new StringWriter input length                     writer append input substring st  i - 1                     writer append value                                 skip escape             st   j   1              i   st                     if  writer    null                writer append input substring st  len                return writer toString                      return input             private static final String     ESCAPES                         quot         - double-quote            amp          amp        amp  - ampersand            lt          lt        lt  - less-than            gt          gt        gt  - greater-than             Mapping to escape ISO-8859-1 characters to their named HTML 3 x equivalents             u00A0    nbsp       non-breaking space            u00A1    iexcl       inverted exclamation mark            u00A2    cent       cent sign            u00A3    pound       pound sign            u00A4    curren       currency sign            u00A5    yen       yen sign   yuan sign            u00A6    brvbar       broken bar   broken vertical bar            u00A7    sect       section sign            u00A8    uml       diaeresis   spacing diaeresis            u00A9    copy          - copyright sign            u00AA    ordf       feminine ordinal indicator            u00AB    laquo       left-pointing double angle quotation mark   left pointing guillemet            u00AC    not       not sign            u00AD    shy       soft hyphen   discretionary hyphen            u00AE    reg          - registered trademark sign            u00AF    macr       macron   spacing macron   overline   APL overbar            u00B0    deg       degree sign            u00B1    plusmn       plus-minus sign   plus-or-minus sign            u00B2    sup2       superscript two   superscript digit two   squared            u00B3    sup3       superscript three   superscript digit three   cubed            u00B4    acute       acute accent   spacing acute            u00B5    micro       micro sign            u00B6    para       pilcrow sign   paragraph sign            u00B7    middot       middle dot   Georgian comma   Greek middle dot            u00B8    cedil       cedilla   spacing cedilla            u00B9    sup1       superscript one   superscript digit one            u00BA    ordm       masculine ordinal indicator            u00BB    raquo       right-pointing double angle quotation mark   right pointing guillemet            u00BC    frac14       vulgar fraction one quarter   fraction one quarter            u00BD    frac12       vulgar fraction one half   fraction one half            u00BE    frac34       vulgar fraction three quarters   fraction three quarters            u00BF    iquest       inverted question mark   turned question mark            u00C0    Agrave         - uppercase A  grave accent            u00C1    Aacute         - uppercase A  acute accent            u00C2    Acirc         - uppercase A  circumflex accent            u00C3    Atilde         - uppercase A  tilde            u00C4    Auml         - uppercase A  umlaut            u00C5    Aring         - uppercase A  ring            u00C6    AElig         - uppercase AE            u00C7    Ccedil         - uppercase C  cedilla            u00C8    Egrave         - uppercase E  grave accent            u00C9    Eacute         - uppercase E  acute accent            u00CA    Ecirc         - uppercase E  circumflex accent            u00CB    Euml         - uppercase E  umlaut            u00CC    Igrave         - uppercase I  grave accent            u00CD    Iacute         - uppercase I  acute accent            u00CE    Icirc         - uppercase I  circumflex accent            u00CF    Iuml         - uppercase I  umlaut            u00D0    ETH         - uppercase Eth  Icelandic            u00D1    Ntilde         - uppercase N  tilde            u00D2    Ograve         - uppercase O  grave accent            u00D3    Oacute         - uppercase O  acute accent            u00D4    Ocirc         - uppercase O  circumflex accent            u00D5    Otilde         - uppercase O  tilde            u00D6    Ouml         - uppercase O  umlaut            u00D7    times       multiplication sign            u00D8    Oslash         - uppercase O  slash            u00D9    Ugrave         - uppercase U  grave accent            u00DA    Uacute         - uppercase U  acute accent            u00DB    Ucirc         - uppercase U  circumflex accent            u00DC    Uuml         - uppercase U  umlaut            u00DD    Yacute         - uppercase Y  acute accent            u00DE    THORN         - uppercase THORN  Icelandic            u00DF    szlig         - lowercase sharps  German            u00E0    agrave         - lowercase a  grave accent            u00E1    aacute         - lowercase a  acute accent            u00E2    acirc         - lowercase a  circumflex accent            u00E3    atilde         - lowercase a  tilde            u00E4    auml         - lowercase a  umlaut            u00E5    aring         - lowercase a  ring            u00E6    aelig         - lowercase ae            u00E7    ccedil         - lowercase c  cedilla            u00E8    egrave         - lowercase e  grave accent            u00E9    eacute         - lowercase e  acute accent            u00EA    ecirc         - lowercase e  circumflex accent            u00EB    euml         - lowercase e  umlaut            u00EC    igrave         - lowercase i  grave accent            u00ED    iacute         - lowercase i  acute accent            u00EE    icirc         - lowercase i  circumflex accent            u00EF    iuml         - lowercase i  umlaut            u00F0    eth         - lowercase eth  Icelandic            u00F1    ntilde         - lowercase n  tilde            u00F2    ograve         - lowercase o  grave accent            u00F3    oacute         - lowercase o  acute accent            u00F4    ocirc         - lowercase o  circumflex accent            u00F5    otilde         - lowercase o  tilde            u00F6    ouml         - lowercase o  umlaut            u00F7    divide       division sign            u00F8    oslash         - lowercase o  slash            u00F9    ugrave         - lowercase u  grave accent            u00FA    uacute         - lowercase u  acute accent            u00FB    ucirc         - lowercase u  circumflex accent            u00FC    uuml         - lowercase u  umlaut            u00FD    yacute         - lowercase y  acute accent            u00FE    thorn         - lowercase thorn  Icelandic            u00FF    yuml         - lowercase y  umlaut             private static final int MIN ESCAPE   2      private static final int MAX ESCAPE   6       private static final HashMap lt String  CharSequence gt  lookupMap      static           lookupMap   new HashMap lt String  CharSequence gt             for  final CharSequence   seq   ESCAPES               lookupMap put seq 1  toString    seq 0

User · Answer

The most reliable way is with   String cleanedString   StringEscapeUtils unescapeHtml4 originalString     from org apache commons lang3 StringEscapeUtils   And to escape the whitespaces   cleanedString   cleanedString trim      This will ensure that whitespaces due to copy and paste in web forms to not get persisted in DB

User · Answer

Incase you want to mimic what php function htmlspecialchars decode does use php function get html translation table   to dump the table and then use the java code like   static Map lt String String gt  html specialchars table   new Hashtable lt String String gt     static           html specialchars table put   amp lt     lt             html specialchars table put   amp gt     gt             html specialchars table put   amp amp     amp       static String htmlspecialchars decode ENT NOQUOTES String s           Enumeration en   html specialchars table keys            while en hasMoreElements                     String key   en nextElement                    String val   html specialchars table get key                   s   s replaceAll key  val                     return s

User · Answer

The following library can also be used for HTML escaping in Java  unbescape   HTML can be unescaped this way   final String unescapedText   HtmlEscape unescapeHtml escapedText

User · Answer

This did the job for me   import org apache commons lang StringEscapeUtils      String decodedXML  StringEscapeUtils unescapeHtml encodedXML     or   import org apache commons lang3 StringEscapeUtils      String decodedXML  StringEscapeUtils unescapeHtml4 encodedXML     I guess its always better to use the lang3 for obvious reasons  Hope this helps

User · Answer

The libraries mentioned in other answers would be fine solutions  but if you already happen to be digging through real-world html in your project  the Jsoup project has a lot more to offer than just managing  ampersand pound FFFF semicolon  things      textValue   lt p gt This is a amp nbsp sample    Granny   Smith  amp  8211   lt   p gt  r n    becomes this  This is a  sample   Granny  Smith         with one line of code     Jsoup parse textValue  getText       for older versions of Jsoup Jsoup parse textValue  text        Another possibility may be the static unescapeEntities method  boolean strictMode   true  String unescapedString   org jsoup parser Parser unescapeEntities textValue  strictMode     And you also get the convenient API for extracting and manipulating data  using the best of DOM  CSS  and jquery-like methods   It s open source and MIT licence

[java] How to unescape HTML character entities in Java?

Examples related to java

Examples related to html

Examples related to string

Examples related to eclipse

Examples related to decode