Recommended method for escaping HTML in Java

Question

Is there a recommended way to escape  lt    gt     and  amp  characters when outputting HTML in plain Java code    Other than manually doing the following  that is    String source    The less than sign   lt   and ampersand   amp   must be escaped before using them in HTML   String escaped   source replace   lt      amp lt    replace   amp      amp amp

User · Answer

Nice short method   public static String escapeHTML String s        StringBuilder out   new StringBuilder Math max 16  s length          for  int i   0  i  lt  s length    i              char c   s charAt i           if  c  gt  127    c           c            c      lt      c      gt      c      amp                  out append   amp                  out append  int  c               out append                 else               out append c                       return out toString        Based on https   stackoverflow com a 8838023 1199155  the amp is missing there   The four characters checked in the if clause are the only ones below 128  according to http   www w3 org TR html4 sgml entities html

User · Answer

The most libraries offer escaping everything they can  including hundreds of symbols and thousands of non-ASCII characters which is not what you want in UTF-8 world   Also  as Jeff Williams noted  there s no single    escape HTML    option  there are several contexts   Assuming you never use unquoted attributes  and keeping in mind that different contexts exist  it ve written my own version   private static final long BODY ESCAPE           1L  lt  lt    amp     1L  lt  lt    lt     1L  lt  lt    gt    private static final long DOUBLE QUOTED ATTR ESCAPE           1L  lt  lt        1L  lt  lt    amp     1L  lt  lt    lt     1L  lt  lt    gt    private static final long SINGLE QUOTED ATTR ESCAPE           1L  lt  lt        1L  lt  lt    amp     1L  lt  lt         1L  lt  lt    lt     1L  lt  lt    gt         quot  and  apos  are 1 char longer than   34  and   39  which I ve decided to use private static final String REPLACEMENTS     amp  34  amp amp  amp  39  amp lt  amp gt    private static final int REPL SLICES        0    5    10   15  19  23           5 lt  lt 5   10 lt  lt 10   15 lt  lt 15   19 lt  lt 20   23 lt  lt 25     These 5-bit numbers packed into a single int    are indices within REPLACEMENTS which is a  flat  String    private static void appendEscaped          StringBuilder builder          CharSequence content          long escapes    pass BODY ESCAPE or   QUOTED ATTR ESCAPE here         int startIdx   0  len   content length        for  int i   0  i  lt  len  i              char c   content charAt i           long one          if    c  amp  63     c   amp  amp    one   1L  lt  lt  c   amp  escapes     0               -                  -                                                                  take only dangerous characters              java shifts longs by 6 least significant bits               e  g   lt  lt  0b110111111 is same as  gt  gt  0b111111               Filter out bigger characters              int index   Long bitCount SINGLE QUOTED ATTR ESCAPE  amp   one - 1                builder append content  startIdx  i    exclusive                          append REPLACEMENTS                              REPL SLICES  gt  gt  gt  5 index  amp  31                              REPL SLICES  gt  gt  gt  5  index 1   amp  31               startIdx   i   1                      builder append content  startIdx  len       Consider copy-pasting from Gist without line length limit

User · Answer

For some purposes  HtmlUtils   import org springframework web util HtmlUtils        HtmlUtils htmlEscapeDecimal   amp       gives  amp  38  HtmlUtils htmlEscape   amp       gives  amp amp

User · Answer

For those who use Google Guava   import com google common html HtmlEscapers        String source    The less than sign   lt   and ampersand   amp   must be escaped before using them in HTML   String escaped   HtmlEscapers htmlEscaper   escape source

User · Answer

While  dfa answer of org apache commons lang StringEscapeUtils escapeHtml is nice and I have used it in the past it should not be used for escaping HTML  or XML  attributes otherwise the whitespace will be normalized  meaning all adjacent whitespace characters become a single space    I know this because I have had bugs filed against my library  JATL  for attributes where whitespace was not preserved  Thus I have a drop in  copy n  paste  class  of which I stole some from JDOM  that differentiates the escaping of attributes and element content   While this may not have mattered as much in the past  proper attribute escaping  it is increasingly become of greater interest given the use use of HTML5 s data- attribute usage

User · Answer

On android  API 16 or greater  you can   Html escapeHtml textToScape     or for lower API   TextUtils htmlEncode textToScape

User · Answer

An alternative to Apache Commons  Use Spring s HtmlUtils htmlEscape String input  method

User · Answer

org apache commons lang3 StringEscapeUtils is now deprecated  You must now use org apache commons text StringEscapeUtils by        lt dependency gt           lt groupId gt org apache commons lt  groupId gt           lt artifactId gt commons-text lt  artifactId gt           lt version gt   commons text version  lt  version gt       lt  dependency gt

User · Answer

There is a newer version of the Apache Commons Lang library and it uses a different package name  org apache commons lang3   The StringEscapeUtils now has different static methods for escaping different types of documents  http   commons apache org proper commons-lang javadocs api-3 0 index html   So to escape HTML version 4 0 string   import static org apache commons lang3 StringEscapeUtils escapeHtml4   String output   escapeHtml4  The less than sign   lt   and ampersand   amp   must be escaped before using them in HTML

User · Answer

StringEscapeUtils from Apache Commons Lang   import static org apache commons lang StringEscapeUtils escapeHtml         String source    The less than sign   lt   and ampersand   amp   must be escaped before using them in HTML   String escaped   escapeHtml source     For version 3   import static org apache commons lang3 StringEscapeUtils escapeHtml4         String escaped   escapeHtml4 source

User · Answer

Be careful with this   There are a number of different  contexts  within an HTML document  Inside an element  quoted attribute value  unquoted attribute value  URL attribute  javascript  CSS  etc     You ll need to use a different encoding method for each of these to prevent Cross-Site Scripting  XSS    Check the OWASP XSS Prevention Cheat Sheet for details on each of these contexts  You can find escaping methods for each of these contexts in the OWASP ESAPI library -- https   github com ESAPI esapi-java-legacy

[java] Recommended method for escaping HTML in Java

Examples related to java

Examples related to html

Examples related to escaping