[java] How to unescape HTML character entities in Java?

Basically I would like to decode a given Html document, and replace all special chars, such as " " -> " ", ">" -> ">".

In .NET we can make use of HttpUtility.HtmlDecode.

What's the equivalent function in Java?

This question is related to java html string eclipse decode

The answer is


This did the job for me,

import org.apache.commons.lang.StringEscapeUtils;
...
String decodedXML= StringEscapeUtils.unescapeHtml(encodedXML);

or

import org.apache.commons.lang3.StringEscapeUtils;
...
String decodedXML= StringEscapeUtils.unescapeHtml4(encodedXML);

I guess its always better to use the lang3 for obvious reasons. Hope this helps :)


In my case i use the replace method by testing every entity in every variable, my code looks like this:

text = text.replace("Ç", "Ç");
text = text.replace("ç", "ç");
text = text.replace("Á", "Á");
text = text.replace("Â", "Â");
text = text.replace("Ã", "Ã");
text = text.replace("É", "É");
text = text.replace("Ê", "Ê");
text = text.replace("Í", "Í");
text = text.replace("Ô", "Ô");
text = text.replace("Õ", "Õ");
text = text.replace("Ó", "Ó");
text = text.replace("Ú", "Ú");
text = text.replace("á", "á");
text = text.replace("â", "â");
text = text.replace("ã", "ã");
text = text.replace("é", "é");
text = text.replace("ê", "ê");
text = text.replace("í", "í");
text = text.replace("ô", "ô");
text = text.replace("õ", "õ");
text = text.replace("ó", "ó");
text = text.replace("ú", "ú");

In my case this worked very well.


A very simple but inefficient solution without any external library is:

public static String unescapeHtml3( String str ) {
    try {
        HTMLDocument doc = new HTMLDocument();
        new HTMLEditorKit().read( new StringReader( "<html><body>" + str ), doc, 0 );
        return doc.getText( 1, doc.getLength() );
    } catch( Exception ex ) {
        return str;
    }
}

This should be use only if you have only small count of string to decode.


The following library can also be used for HTML escaping in Java: unbescape.

HTML can be unescaped this way:

final String unescapedText = HtmlEscape.unescapeHtml(escapedText); 

Consider using the HtmlManipulator Java class. You may need to add some items (not all entities are in the list).

The Apache Commons StringEscapeUtils as suggested by Kevin Hakanson did not work 100% for me; several entities like &#145 (left single quote) were translated into '222' somehow. I also tried org.jsoup, and had the same problem.


Incase you want to mimic what php function htmlspecialchars_decode does use php function get_html_translation_table() to dump the table and then use the java code like,

static Map<String,String> html_specialchars_table = new Hashtable<String,String>();
static {
        html_specialchars_table.put("&lt;","<");
        html_specialchars_table.put("&gt;",">");
        html_specialchars_table.put("&amp;","&");
}
static String htmlspecialchars_decode_ENT_NOQUOTES(String s){
        Enumeration en = html_specialchars_table.keys();
        while(en.hasMoreElements()){
                String key = en.nextElement();
                String val = html_specialchars_table.get(key);
                s = s.replaceAll(key, val);
        }
        return s;
}

I tried Apache Commons StringEscapeUtils.unescapeHtml3() in my project, but wasn't satisfied with its performance. Turns out, it does a lot of unnecessary operations. For one, it allocates a StringWriter for every call, even if there's nothing to unescape in the string. I've rewritten that code differently, now it works much faster. Whoever finds this in google is welcome to use it.

Following code unescapes all HTML 3 symbols and numeric escapes (equivalent to Apache unescapeHtml3). You can just add more entries to the map if you need HTML 4.

package com.example;

import java.io.StringWriter;
import java.util.HashMap;

public class StringUtils {

    public static final String unescapeHtml3(final String input) {
        StringWriter writer = null;
        int len = input.length();
        int i = 1;
        int st = 0;
        while (true) {
            // look for '&'
            while (i < len && input.charAt(i-1) != '&')
                i++;
            if (i >= len)
                break;

            // found '&', look for ';'
            int j = i;
            while (j < len && j < i + MAX_ESCAPE + 1 && input.charAt(j) != ';')
                j++;
            if (j == len || j < i + MIN_ESCAPE || j == i + MAX_ESCAPE + 1) {
                i++;
                continue;
            }

            // found escape 
            if (input.charAt(i) == '#') {
                // numeric escape
                int k = i + 1;
                int radix = 10;

                final char firstChar = input.charAt(k);
                if (firstChar == 'x' || firstChar == 'X') {
                    k++;
                    radix = 16;
                }

                try {
                    int entityValue = Integer.parseInt(input.substring(k, j), radix);

                    if (writer == null) 
                        writer = new StringWriter(input.length());
                    writer.append(input.substring(st, i - 1));

                    if (entityValue > 0xFFFF) {
                        final char[] chrs = Character.toChars(entityValue);
                        writer.write(chrs[0]);
                        writer.write(chrs[1]);
                    } else {
                        writer.write(entityValue);
                    }

                } catch (NumberFormatException ex) { 
                    i++;
                    continue;
                }
            }
            else {
                // named escape
                CharSequence value = lookupMap.get(input.substring(i, j));
                if (value == null) {
                    i++;
                    continue;
                }

                if (writer == null) 
                    writer = new StringWriter(input.length());
                writer.append(input.substring(st, i - 1));

                writer.append(value);
            }

            // skip escape
            st = j + 1;
            i = st;
        }

        if (writer != null) {
            writer.append(input.substring(st, len));
            return writer.toString();
        }
        return input;
    }

    private static final String[][] ESCAPES = {
        {"\"",     "quot"}, // " - double-quote
        {"&",      "amp"}, // & - ampersand
        {"<",      "lt"}, // < - less-than
        {">",      "gt"}, // > - greater-than

        // Mapping to escape ISO-8859-1 characters to their named HTML 3.x equivalents.
        {"\u00A0", "nbsp"}, // non-breaking space
        {"\u00A1", "iexcl"}, // inverted exclamation mark
        {"\u00A2", "cent"}, // cent sign
        {"\u00A3", "pound"}, // pound sign
        {"\u00A4", "curren"}, // currency sign
        {"\u00A5", "yen"}, // yen sign = yuan sign
        {"\u00A6", "brvbar"}, // broken bar = broken vertical bar
        {"\u00A7", "sect"}, // section sign
        {"\u00A8", "uml"}, // diaeresis = spacing diaeresis
        {"\u00A9", "copy"}, // © - copyright sign
        {"\u00AA", "ordf"}, // feminine ordinal indicator
        {"\u00AB", "laquo"}, // left-pointing double angle quotation mark = left pointing guillemet
        {"\u00AC", "not"}, // not sign
        {"\u00AD", "shy"}, // soft hyphen = discretionary hyphen
        {"\u00AE", "reg"}, // ® - registered trademark sign
        {"\u00AF", "macr"}, // macron = spacing macron = overline = APL overbar
        {"\u00B0", "deg"}, // degree sign
        {"\u00B1", "plusmn"}, // plus-minus sign = plus-or-minus sign
        {"\u00B2", "sup2"}, // superscript two = superscript digit two = squared
        {"\u00B3", "sup3"}, // superscript three = superscript digit three = cubed
        {"\u00B4", "acute"}, // acute accent = spacing acute
        {"\u00B5", "micro"}, // micro sign
        {"\u00B6", "para"}, // pilcrow sign = paragraph sign
        {"\u00B7", "middot"}, // middle dot = Georgian comma = Greek middle dot
        {"\u00B8", "cedil"}, // cedilla = spacing cedilla
        {"\u00B9", "sup1"}, // superscript one = superscript digit one
        {"\u00BA", "ordm"}, // masculine ordinal indicator
        {"\u00BB", "raquo"}, // right-pointing double angle quotation mark = right pointing guillemet
        {"\u00BC", "frac14"}, // vulgar fraction one quarter = fraction one quarter
        {"\u00BD", "frac12"}, // vulgar fraction one half = fraction one half
        {"\u00BE", "frac34"}, // vulgar fraction three quarters = fraction three quarters
        {"\u00BF", "iquest"}, // inverted question mark = turned question mark
        {"\u00C0", "Agrave"}, // ? - uppercase A, grave accent
        {"\u00C1", "Aacute"}, // ? - uppercase A, acute accent
        {"\u00C2", "Acirc"}, // ? - uppercase A, circumflex accent
        {"\u00C3", "Atilde"}, // ? - uppercase A, tilde
        {"\u00C4", "Auml"}, // ? - uppercase A, umlaut
        {"\u00C5", "Aring"}, // ? - uppercase A, ring
        {"\u00C6", "AElig"}, // ? - uppercase AE
        {"\u00C7", "Ccedil"}, // ? - uppercase C, cedilla
        {"\u00C8", "Egrave"}, // ? - uppercase E, grave accent
        {"\u00C9", "Eacute"}, // ? - uppercase E, acute accent
        {"\u00CA", "Ecirc"}, // ? - uppercase E, circumflex accent
        {"\u00CB", "Euml"}, // ? - uppercase E, umlaut
        {"\u00CC", "Igrave"}, // ? - uppercase I, grave accent
        {"\u00CD", "Iacute"}, // ? - uppercase I, acute accent
        {"\u00CE", "Icirc"}, // ? - uppercase I, circumflex accent
        {"\u00CF", "Iuml"}, // ? - uppercase I, umlaut
        {"\u00D0", "ETH"}, // ? - uppercase Eth, Icelandic
        {"\u00D1", "Ntilde"}, // ? - uppercase N, tilde
        {"\u00D2", "Ograve"}, // ? - uppercase O, grave accent
        {"\u00D3", "Oacute"}, // ? - uppercase O, acute accent
        {"\u00D4", "Ocirc"}, // ? - uppercase O, circumflex accent
        {"\u00D5", "Otilde"}, // ? - uppercase O, tilde
        {"\u00D6", "Ouml"}, // ? - uppercase O, umlaut
        {"\u00D7", "times"}, // multiplication sign
        {"\u00D8", "Oslash"}, // ? - uppercase O, slash
        {"\u00D9", "Ugrave"}, // ? - uppercase U, grave accent
        {"\u00DA", "Uacute"}, // ? - uppercase U, acute accent
        {"\u00DB", "Ucirc"}, // ? - uppercase U, circumflex accent
        {"\u00DC", "Uuml"}, // ? - uppercase U, umlaut
        {"\u00DD", "Yacute"}, // ? - uppercase Y, acute accent
        {"\u00DE", "THORN"}, // ? - uppercase THORN, Icelandic
        {"\u00DF", "szlig"}, // ? - lowercase sharps, German
        {"\u00E0", "agrave"}, // ? - lowercase a, grave accent
        {"\u00E1", "aacute"}, // ? - lowercase a, acute accent
        {"\u00E2", "acirc"}, // ? - lowercase a, circumflex accent
        {"\u00E3", "atilde"}, // ? - lowercase a, tilde
        {"\u00E4", "auml"}, // ? - lowercase a, umlaut
        {"\u00E5", "aring"}, // ? - lowercase a, ring
        {"\u00E6", "aelig"}, // ? - lowercase ae
        {"\u00E7", "ccedil"}, // ? - lowercase c, cedilla
        {"\u00E8", "egrave"}, // ? - lowercase e, grave accent
        {"\u00E9", "eacute"}, // ? - lowercase e, acute accent
        {"\u00EA", "ecirc"}, // ? - lowercase e, circumflex accent
        {"\u00EB", "euml"}, // ? - lowercase e, umlaut
        {"\u00EC", "igrave"}, // ? - lowercase i, grave accent
        {"\u00ED", "iacute"}, // ? - lowercase i, acute accent
        {"\u00EE", "icirc"}, // ? - lowercase i, circumflex accent
        {"\u00EF", "iuml"}, // ? - lowercase i, umlaut
        {"\u00F0", "eth"}, // ? - lowercase eth, Icelandic
        {"\u00F1", "ntilde"}, // ? - lowercase n, tilde
        {"\u00F2", "ograve"}, // ? - lowercase o, grave accent
        {"\u00F3", "oacute"}, // ? - lowercase o, acute accent
        {"\u00F4", "ocirc"}, // ? - lowercase o, circumflex accent
        {"\u00F5", "otilde"}, // ? - lowercase o, tilde
        {"\u00F6", "ouml"}, // ? - lowercase o, umlaut
        {"\u00F7", "divide"}, // division sign
        {"\u00F8", "oslash"}, // ? - lowercase o, slash
        {"\u00F9", "ugrave"}, // ? - lowercase u, grave accent
        {"\u00FA", "uacute"}, // ? - lowercase u, acute accent
        {"\u00FB", "ucirc"}, // ? - lowercase u, circumflex accent
        {"\u00FC", "uuml"}, // ? - lowercase u, umlaut
        {"\u00FD", "yacute"}, // ? - lowercase y, acute accent
        {"\u00FE", "thorn"}, // ? - lowercase thorn, Icelandic
        {"\u00FF", "yuml"}, // ? - lowercase y, umlaut
    };

    private static final int MIN_ESCAPE = 2;
    private static final int MAX_ESCAPE = 6;

    private static final HashMap<String, CharSequence> lookupMap;
    static {
        lookupMap = new HashMap<String, CharSequence>();
        for (final CharSequence[] seq : ESCAPES) 
            lookupMap.put(seq[1].toString(), seq[0]);
    }

}

The libraries mentioned in other answers would be fine solutions, but if you already happen to be digging through real-world html in your project, the Jsoup project has a lot more to offer than just managing "ampersand pound FFFF semicolon" things.

// textValue: <p>This is a&nbsp;sample. \"Granny\" Smith &#8211;.<\/p>\r\n
// becomes this: This is a sample. "Granny" Smith –.
// with one line of code:
// Jsoup.parse(textValue).getText(); // for older versions of Jsoup
Jsoup.parse(textValue).text();

// Another possibility may be the static unescapeEntities method:
boolean strictMode = true;
String unescapedString = org.jsoup.parser.Parser.unescapeEntities(textValue, strictMode);

And you also get the convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods. It's open source and MIT licence.


The most reliable way is with

String cleanedString = StringEscapeUtils.unescapeHtml4(originalString);

from org.apache.commons.lang3.StringEscapeUtils.

And to escape the whitespaces

cleanedString = cleanedString.trim();

This will ensure that whitespaces due to copy and paste in web forms to not get persisted in DB.


Spring Framework HtmlUtils

If you're using Spring framework already, use the following method:

import static org.springframework.web.util.HtmlUtils.htmlUnescape;

...

String result = htmlUnescape(source);


Examples related to java

Under what circumstances can I call findViewById with an Options Menu / Action Bar item? How much should a function trust another function How to implement a simple scenario the OO way Two constructors How do I get some variable from another class in Java? this in equals method How to split a string in two and store it in a field How to do perspective fixing? String index out of range: 4 My eclipse won't open, i download the bundle pack it keeps saying error log

Examples related to html

Embed ruby within URL : Middleman Blog Please help me convert this script to a simple image slider Generating a list of pages (not posts) without the index file Why there is this "clear" class before footer? Is it possible to change the content HTML5 alert messages? Getting all files in directory with ajax DevTools failed to load SourceMap: Could not load content for chrome-extension How to set width of mat-table column in angular? How to open a link in new tab using angular? ERROR Error: Uncaught (in promise), Cannot match any routes. URL Segment

Examples related to string

How to split a string in two and store it in a field String method cannot be found in a main class method Kotlin - How to correctly concatenate a String Replacing a character from a certain index Remove quotes from String in Python Detect whether a Python string is a number or a letter How does String substring work in Swift How does String.Index work in Swift swift 3.0 Data to String? How to parse JSON string in Typescript

Examples related to eclipse

How do I get the command-line for an Eclipse run configuration? My eclipse won't open, i download the bundle pack it keeps saying error log strange error in my Animation Drawable How to uninstall Eclipse? How to resolve Unable to load authentication plugin 'caching_sha2_password' issue Class has been compiled by a more recent version of the Java Environment Eclipse No tests found using JUnit 5 caused by NoClassDefFoundError for LauncherFactory How to downgrade Java from 9 to 8 on a MACOS. Eclipse is not running with Java 9 "The POM for ... is missing, no dependency information available" even though it exists in Maven Repository The origin server did not find a current representation for the target resource or is not willing to disclose that one exists. on deploying to tomcat

Examples related to decode

How to decode encrypted wordpress admin password? How to decode a QR-code image in (preferably pure) Python? Write Base64-encoded image to file Base64 Java encode and decode a string Android - How to decode and decompile any APK file? UnicodeEncodeError: 'charmap' codec can't encode - character maps to <undefined>, print function PHP replacing special characters like à->a, è->e UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to <undefined> How do I decode a string with escaped unicode? UnicodeDecodeError, invalid continuation byte