Remove HTML tags from a String

Question

Is there a good way to remove HTML from a Java string  A simple regex like replaceAll  quot    lt     gt  quot    quot  quot     will work  but things like  amp amp  wont be converted correctly and non-HTML between the two angle brackets will be removed  i e  the     in the regex will disappear

User · Answer

To get formateed plain html text you can do that   String BR ESCAPED     amp lt br  amp gt    Element el Jsoup parse html  select  body    el select  br   append BR ESCAPED   el select  p   append BR ESCAPED BR ESCAPED   el select  h1   append BR ESCAPED BR ESCAPED   el select  h2   append BR ESCAPED BR ESCAPED   el select  h3   append BR ESCAPED BR ESCAPED   el select  h4   append BR ESCAPED BR ESCAPED   el select  h5   append BR ESCAPED BR ESCAPED   String nodeValue el text    nodeValue nodeValue replaceAll BR ESCAPED    lt br  gt     nodeValue nodeValue replaceAll     s  lt br   gt    gt   3       lt br  gt  lt br  gt       To get formateed plain text change  lt br  gt  by  n and change last line by   nodeValue nodeValue replaceAll     s  n  3       lt br  gt  lt br  gt

User · Answer

I often find that I only need to strip out comments and script elements  This has worked reliably for me for 15 years and can easily be extended to handle any element name in HTML or XML     delete all comments response   response replaceAll  quot  lt  --   gt   -- gt  quot    quot  quot       delete all script elements response   response replaceAll  quot  lt  script SCRIPT        gt    gt     lt   script SCRIPT  gt  quot    quot  quot

User · Answer

It sounds like you want to go from HTML to plain text  If that is the case look at www htmlparser org   Here is an example that strips all the tags out from the html file found at a URL  It makes use of org htmlparser beans StringBean   static public String getUrlContentsAsText String url        String content           StringBean stringBean   new StringBean        stringBean setURL url       content   stringBean getStrings        return content

User · Answer

classeString replaceAll     lt         gt       gt           replaceAll    s         trim

User · Answer

If the user enters  lt b gt hey  lt  b gt   do you want to display  lt b gt hey  lt  b gt  or hey    If the first  escape less-thans  and html-encode ampersands  and optionally quotes  and you re fine   A modification to your code to implement the second option would be   replaceAll     lt    gt    gt         but you will run into issues if the user enters something malformed  like  lt bhey  lt  b gt    You can also check out JTidy which will parse  dirty  html input  and should give you a way to remove the tags  keeping the text   The problem with trying to strip html is that browsers have very lenient parsers  more lenient than any library you can find will  so even if you do your best to strip all tags  using the replace method above  a DOM library  or JTidy   you will still need to make sure to encode any remaining HTML special characters to keep your output safe

User · Answer

One more way can be to use com google gdata util common html HtmlToText class  like   MyWriter toConsole HtmlToText htmlToPlainText htmlResponse      This is not bullet proof code though and when I run it on wikipedia entries I am getting style info also  However I believe for small simple jobs this would be effective

User · Answer

I know it is been a while since this question as been asked  but I found another solution  this is what worked for me   Pattern REMOVE TAGS   Pattern compile   lt     gt         Source source  new Source htmlAsString    Matcher m   REMOVE TAGS matcher sourceStep getTextExtractor   toString                             String clearedHtml  m replaceAll

User · Answer

One more way can be to use com google gdata util common html HtmlToText class  like   MyWriter toConsole HtmlToText htmlToPlainText htmlResponse      This is not bullet proof code though and when I run it on wikipedia entries I am getting style info also  However I believe for small simple jobs this would be effective

User · Answer

My 5 cents   String   temp   yourString split   amp amp     String tmp       if  temp length  gt  1         for  int i   0  i  lt  temp length  i              tmp    temp i      amp              yourString   tmp substring 0  tmp length   - 1

User · Answer

HTML Escaping is really hard to do right- I d definitely suggest using library code to do this  as it s a lot more subtle than you d think  Check out Apache s StringEscapeUtils for a pretty good library for handling this in Java

User · Answer

Here is one more variant of how to replace all HTML Tags    HTML Entities   Empty Space in HTML content   content replaceAll    lt     gt     amp            2           where content is a String

User · Answer

Remove HTML tags from string  Somewhere we need to parse some string which is received by some responses like Httpresponse from the server   So we need to parse it   Here I will show how to remove html tags from string          sample text with tags      string str     lt html gt  lt head gt sdfkashf sdf lt  head gt  lt body gt sdfasdf lt  body gt  lt  html gt              regex which match tags      System Text RegularExpressions Regex rx   new System Text RegularExpressions Regex   lt    gt    gt               replace all matches with empty strin      str   rx Replace str                now str contains string without html tags

User · Answer

Worth noting that if you re trying to accomplish this in a Service Stack project  it s already a built-in string extension using ServiceStack Text          quot The  lt b gt quick lt  b gt  brown  lt p gt  fox  lt  p gt  jumps over the lazy dog quot  StripHtml

User · Answer

Another way is to use  javax swing text html HTMLEditorKit to extract the text   import java io    import javax swing text html    import javax swing text html parser     public class Html2Text extends HTMLEditorKit ParserCallback       StringBuffer s       public Html2Text                public void parse Reader in  throws IOException           s   new StringBuffer            ParserDelegator delegator   new ParserDelegator               the third parameter is TRUE to ignore charset directive         delegator parse in  this  Boolean TRUE              public void handleText char   text  int pos            s append text              public String getText             return s toString               public static void main String   args            try                  the HTML to convert             FileReader in   new FileReader  java-new html                Html2Text parser   new Html2Text                parser parse in               in close                System out println parser getText               catch  Exception e                e printStackTrace                        ref   Remove HTML tags from a file to extract only the TEXT

User · Answer

One could also use Apache Tika for this purpose  By default it preserves whitespaces from the stripped html  which may be desired in certain situations   InputStream htmlInputStream      HtmlParser htmlParser   new HtmlParser    HtmlContentHandler htmlContentHandler   new HtmlContentHandler    htmlParser parse htmlInputStream  htmlContentHandler  new Metadata    System out println htmlContentHandler getBodyText   trim

User · Answer

The accepted answer did not work for me for the test case I indicated  the result of  a  lt  b or b   c  is  a b or b   c    So  I used TagSoup instead   Here s a shot that worked for my test case  and a couple of others    import java io IOException  import java io StringReader  import java util logging Logger   import org ccil cowan tagsoup Parser  import org xml sax Attributes  import org xml sax ContentHandler  import org xml sax InputSource  import org xml sax Locator  import org xml sax SAXException  import org xml sax XMLReader          Take HTML and give back the text part while dropping the HTML tags        There is some risk that using TagSoup means we ll permute non-HTML text     However  it seems to work the best so far in test cases         author dan     see  lt a href  http   home ccil org  cowan XML tagsoup   gt TagSoup lt  a gt       public class Html2Text2 implements ContentHandler   private StringBuffer sb   public Html2Text2        public void parse String str  throws IOException  SAXException       XMLReader reader   new Parser        reader setContentHandler this       sb   new StringBuffer        reader parse new InputSource new StringReader str        public String getText         return sb toString        Override public void characters char   ch  int start  int length      throws SAXException       for  int idx   0  idx  lt  length  idx          sb append ch idx start              Override public void ignorableWhitespace char   ch  int start  int length      throws SAXException       sb append ch         The methods below do not contribute to the text  Override public void endDocument   throws SAXException       Override public void endElement String uri  String localName  String qName      throws SAXException       Override public void endPrefixMapping String prefix  throws SAXException        Override public void processingInstruction String target  String data      throws SAXException       Override public void setDocumentLocator Locator locator        Override public void skippedEntity String name  throws SAXException       Override public void startDocument   throws SAXException       Override public void startElement String uri  String localName  String qName      Attributes atts  throws SAXException       Override public void startPrefixMapping String prefix  String uri      throws SAXException

User · Answer

Alternatively  one can use HtmlCleaner   private CharSequence removeHtmlFrom String html        return new HtmlCleaner   clean html  getText

User · Answer

If you re writing for Android you can do this    android text HtmlCompat fromHtml instruction  HtmlCompat FROM HTML MODE LEGACY  toString

User · Answer

It sounds like you want to go from HTML to plain text  If that is the case look at www htmlparser org   Here is an example that strips all the tags out from the html file found at a URL  It makes use of org htmlparser beans StringBean   static public String getUrlContentsAsText String url        String content           StringBean stringBean   new StringBean        stringBean setURL url       content   stringBean getStrings        return content

User · Answer

Here is another way to do it   public static String removeHTML String input        int i   0      String   str   input split           String s           boolean inTag   false       for  i   input indexOf   lt     i  lt  input indexOf   gt     i              inTag   true            if   inTag            for  i   0  i  lt  str length  i                  s   s   str i                       return s

User · Answer

On Android  try this   String result   Html fromHtml html  toString

User · Answer

It sounds like you want to go from HTML to plain text  If that is the case look at www htmlparser org   Here is an example that strips all the tags out from the html file found at a URL  It makes use of org htmlparser beans StringBean   static public String getUrlContentsAsText String url        String content           StringBean stringBean   new StringBean        stringBean setURL url       content   stringBean getStrings        return content

User · Answer

My 5 cents   String   temp   yourString split   amp amp     String tmp       if  temp length  gt  1         for  int i   0  i  lt  temp length  i              tmp    temp i      amp              yourString   tmp substring 0  tmp length   - 1

User · Answer

I know this is old  but I was just working on a project that required me to filter HTML and this worked fine   noHTMLString replaceAll     amp                 instead of this   html   html replaceAll   amp nbsp        html   html replaceAll   amp amp

User · Answer

You can simply use the Android s default HTML filter      public String htmlToStringFilter String textToFilter        return Html fromHtml textToFilter  toString             The above method will return the HTML filtered string for your input

User · Answer

Try this for javascript  const strippedString   htmlString replace    lt     gt     gt   gi   quot  quot    console log strippedString

User · Answer

Try this for javascript  const strippedString   htmlString replace    lt     gt     gt   gi   quot  quot    console log strippedString

User · Answer

You might want to replace  lt br  gt  and  lt  p gt  tags with newlines before stripping the HTML to prevent it becoming an illegible mess as Tim suggests   The only way I can think of removing HTML tags but leaving non-HTML between angle brackets would be check against a list of HTML tags  Something along these lines     replaceAll     lt   s  tag   gt    gt         Then HTML-decode special characters such as  amp amp   The result should not be considered to be sanitized

User · Answer

Another way is to use  javax swing text html HTMLEditorKit to extract the text   import java io    import javax swing text html    import javax swing text html parser     public class Html2Text extends HTMLEditorKit ParserCallback       StringBuffer s       public Html2Text                public void parse Reader in  throws IOException           s   new StringBuffer            ParserDelegator delegator   new ParserDelegator               the third parameter is TRUE to ignore charset directive         delegator parse in  this  Boolean TRUE              public void handleText char   text  int pos            s append text              public String getText             return s toString               public static void main String   args            try                  the HTML to convert             FileReader in   new FileReader  java-new html                Html2Text parser   new Html2Text                parser parse in               in close                System out println parser getText               catch  Exception e                e printStackTrace                        ref   Remove HTML tags from a file to extract only the TEXT

User · Answer

If the user enters  lt b gt hey  lt  b gt   do you want to display  lt b gt hey  lt  b gt  or hey    If the first  escape less-thans  and html-encode ampersands  and optionally quotes  and you re fine   A modification to your code to implement the second option would be   replaceAll     lt    gt    gt         but you will run into issues if the user enters something malformed  like  lt bhey  lt  b gt    You can also check out JTidy which will parse  dirty  html input  and should give you a way to remove the tags  keeping the text   The problem with trying to strip html is that browsers have very lenient parsers  more lenient than any library you can find will  so even if you do your best to strip all tags  using the replace method above  a DOM library  or JTidy   you will still need to make sure to encode any remaining HTML special characters to keep your output safe

User · Answer

Use Html fromHtml  HTML Tags are   lt a href           gt   lt b gt     lt big gt    lt blockquote gt    lt br gt    lt cite gt    lt dfn gt   lt div align           gt     lt em gt    lt font size           color           face           gt   lt h1 gt     lt h2 gt    lt h3 gt    lt h4 gt     lt h5 gt    lt h6 gt   lt i gt    lt p gt    lt small gt   lt strike gt     lt strong gt    lt sub gt    lt sup gt    lt tt gt    lt u gt    As per Android   s official Documentations any  tags in the HTML will display as a generic replacement String which your program can then go through and replace with real strings   Html formHtml method takes an Html TagHandler and an Html ImageGetter as arguments as well as the text to parse   Example  String Str Html    lt p gt This is about me text that the user can put into their profile lt  p gt       Then  Your TextView Obj setText Html fromHtml Str Html  toString       Output  This is about me text that the user can put into their profile

User · Answer

You might want to replace  lt br  gt  and  lt  p gt  tags with newlines before stripping the HTML to prevent it becoming an illegible mess as Tim suggests   The only way I can think of removing HTML tags but leaving non-HTML between angle brackets would be check against a list of HTML tags  Something along these lines     replaceAll     lt   s  tag   gt    gt         Then HTML-decode special characters such as  amp amp   The result should not be considered to be sanitized

User · Answer

This should work -   use this    text replaceAll   lt     gt          - gt  This will replace all the html tags with a space    and this    text replaceAll   amp            - gt  this will replace all the tags which starts with   amp   and ends with     like  amp nbsp    amp amp    amp gt  etc

User · Answer

The accepted answer of doing simply Jsoup parse html  text   has 2 potential issues  with JSoup 1 7 3     It removes line breaks from the text It converts text  amp lt script amp gt  into  lt script gt    If you use this to protect against XSS  this is a bit annoying  Here is my best shot at an improved solution  using both JSoup and Apache StringEscapeUtils      breaks multi-level of escaping  preventing  amp amp lt script amp amp gt  to be rendered as  lt script gt  String replace   input replace   amp amp            decode any encoded html  preventing  amp lt script amp gt  to be rendered as  lt script gt  String html   StringEscapeUtils unescapeHtml replace      remove all html tags  but maintain line breaks String clean   Jsoup clean html      Whitelist none    new Document OutputSettings   prettyPrint false       decode html again to convert character entities back into text return StringEscapeUtils unescapeHtml clean     Note that the last step is because I need to use the output as plain text  If you need only HTML output then you should be able to remove it   And here is a bunch of test cases  input to output      regular string    regular string       lt a href   link   gt A link lt  a gt     A link       lt script src   http   evil url com    gt            amp lt script amp gt            amp amp lt script amp amp gt     lt scriptgt        best effort         gt   lt   n                and  amp  preserved          gt   lt   n                and  amp  preserved     If you find a way to make it better  please let me know

User · Answer

I know this is old  but I was just working on a project that required me to filter HTML and this worked fine   noHTMLString replaceAll     amp                 instead of this   html   html replaceAll   amp nbsp        html   html replaceAll   amp amp

User · Answer

Use Html fromHtml  HTML Tags are   lt a href           gt   lt b gt     lt big gt    lt blockquote gt    lt br gt    lt cite gt    lt dfn gt   lt div align           gt     lt em gt    lt font size           color           face           gt   lt h1 gt     lt h2 gt    lt h3 gt    lt h4 gt     lt h5 gt    lt h6 gt   lt i gt    lt p gt    lt small gt   lt strike gt     lt strong gt    lt sub gt    lt sup gt    lt tt gt    lt u gt    As per Android   s official Documentations any  tags in the HTML will display as a generic replacement String which your program can then go through and replace with real strings   Html formHtml method takes an Html TagHandler and an Html ImageGetter as arguments as well as the text to parse   Example  String Str Html    lt p gt This is about me text that the user can put into their profile lt  p gt       Then  Your TextView Obj setText Html fromHtml Str Html  toString       Output  This is about me text that the user can put into their profile

User · Answer

Also very simple using Jericho  and you can retain some of the formatting  line breaks and links  for example        Source htmlSource   new Source htmlText       Segment htmlSeg   new Segment htmlSource  0  htmlSource length         Renderer htmlRend   new Renderer htmlSeg       System out println htmlRend toString

User · Answer

I think that the simpliest way to filter the html tags is   private static final Pattern REMOVE TAGS   Pattern compile   lt     gt      public static String removeTags String string        if  string    null    string length      0            return string             Matcher m   REMOVE TAGS matcher string       return m replaceAll

User · Answer

Alternatively  one can use HtmlCleaner   private CharSequence removeHtmlFrom String html        return new HtmlCleaner   clean html  getText

User · Answer

you can simply make a method with multiple replaceAll   like  String RemoveTag String html      html   html replaceAll     lt     gt          html   html replaceAll   amp nbsp           html   html replaceAll   amp amp           ----------    ----------    return html      Use this link for most common replacements you need  http   tunes org wiki html 20special 20characters 20and 20symbols html  It is simple but effective  I use this method first to remove the junk but not the very first line i e replaceAll    lt            and later i use specific keywords to search for indexes and then use  substring start  end  method to strip away unnecessary stuff  As this is more robust and you can pin point exactly what you need in the entire html page

User · Answer

It sounds like you want to go from HTML to plain text  If that is the case look at www htmlparser org   Here is an example that strips all the tags out from the html file found at a URL  It makes use of org htmlparser beans StringBean   static public String getUrlContentsAsText String url        String content           StringBean stringBean   new StringBean        stringBean setURL url       content   stringBean getStrings        return content

User · Answer

Sometimes the html string come from xml  with such  amp lt  When using Jsoup we need parse it and then clean it  Document doc   Jsoup parse htmlstrl   Whitelist wl   Whitelist none    String plain   Jsoup clean doc text    wl    While only using Jsoup parse htmlstrl  text   can t remove tags

User · Answer

If the user enters  lt b gt hey  lt  b gt   do you want to display  lt b gt hey  lt  b gt  or hey    If the first  escape less-thans  and html-encode ampersands  and optionally quotes  and you re fine   A modification to your code to implement the second option would be   replaceAll     lt    gt    gt         but you will run into issues if the user enters something malformed  like  lt bhey  lt  b gt    You can also check out JTidy which will parse  dirty  html input  and should give you a way to remove the tags  keeping the text   The problem with trying to strip html is that browsers have very lenient parsers  more lenient than any library you can find will  so even if you do your best to strip all tags  using the replace method above  a DOM library  or JTidy   you will still need to make sure to encode any remaining HTML special characters to keep your output safe

User · Answer

The accepted answer of doing simply Jsoup parse html  text   has 2 potential issues  with JSoup 1 7 3     It removes line breaks from the text It converts text  amp lt script amp gt  into  lt script gt    If you use this to protect against XSS  this is a bit annoying  Here is my best shot at an improved solution  using both JSoup and Apache StringEscapeUtils      breaks multi-level of escaping  preventing  amp amp lt script amp amp gt  to be rendered as  lt script gt  String replace   input replace   amp amp            decode any encoded html  preventing  amp lt script amp gt  to be rendered as  lt script gt  String html   StringEscapeUtils unescapeHtml replace      remove all html tags  but maintain line breaks String clean   Jsoup clean html      Whitelist none    new Document OutputSettings   prettyPrint false       decode html again to convert character entities back into text return StringEscapeUtils unescapeHtml clean     Note that the last step is because I need to use the output as plain text  If you need only HTML output then you should be able to remove it   And here is a bunch of test cases  input to output      regular string    regular string       lt a href   link   gt A link lt  a gt     A link       lt script src   http   evil url com    gt            amp lt script amp gt            amp amp lt script amp amp gt     lt scriptgt        best effort         gt   lt   n                and  amp  preserved          gt   lt   n                and  amp  preserved     If you find a way to make it better  please let me know

User · Answer

HTML Escaping is really hard to do right- I d definitely suggest using library code to do this  as it s a lot more subtle than you d think  Check out Apache s StringEscapeUtils for a pretty good library for handling this in Java

User · Answer

Use a HTML parser instead of regex  This is dead simple with Jsoup  public static String html2text String html        return Jsoup parse html  text       Jsoup also supports removing HTML tags against a customizable whitelist  which is very useful if you want to allow only e g   lt b gt    lt i gt  and  lt u gt   See also   RegEx match open tags except XHTML self-contained tags What are the pros and cons of the leading Java HTML parsers  XSS prevention in JSP Servlet web application

User · Answer

classeString replaceAll     lt         gt       gt           replaceAll    s         trim

User · Answer

One way to retain new-line info with JSoup is to precede all new line tags with some dummy string  execute JSoup and replace dummy string with   n    String html     lt p gt Line one lt  p gt  lt p gt Line two lt  p gt Line three lt br  gt etc    String NEW LINE MARK    NEWLINESTART1234567890NEWLINEEND   for  String tag  new String     lt  p gt     lt br  gt     lt  h1 gt     lt  h2 gt     lt  h3 gt     lt  h4 gt     lt  h5 gt     lt  h6 gt     lt  li gt           html   html replace tag  NEW LINE MARK tag      String text   Jsoup parse html  text     text   text replace NEW LINE MARK          n n    text   text replace NEW LINE MARK    n n

User · Answer

I often find that I only need to strip out comments and script elements  This has worked reliably for me for 15 years and can easily be extended to handle any element name in HTML or XML     delete all comments response   response replaceAll  quot  lt  --   gt   -- gt  quot    quot  quot       delete all script elements response   response replaceAll  quot  lt  script SCRIPT        gt    gt     lt   script SCRIPT  gt  quot    quot  quot

User · Answer

Worth noting that if you re trying to accomplish this in a Service Stack project  it s already a built-in string extension using ServiceStack Text          quot The  lt b gt quick lt  b gt  brown  lt p gt  fox  lt  p gt  jumps over the lazy dog quot  StripHtml

User · Answer

If the user enters  lt b gt hey  lt  b gt   do you want to display  lt b gt hey  lt  b gt  or hey    If the first  escape less-thans  and html-encode ampersands  and optionally quotes  and you re fine   A modification to your code to implement the second option would be   replaceAll     lt    gt    gt         but you will run into issues if the user enters something malformed  like  lt bhey  lt  b gt    You can also check out JTidy which will parse  dirty  html input  and should give you a way to remove the tags  keeping the text   The problem with trying to strip html is that browsers have very lenient parsers  more lenient than any library you can find will  so even if you do your best to strip all tags  using the replace method above  a DOM library  or JTidy   you will still need to make sure to encode any remaining HTML special characters to keep your output safe

User · Answer

I think that the simpliest way to filter the html tags is   private static final Pattern REMOVE TAGS   Pattern compile   lt     gt      public static String removeTags String string        if  string    null    string length      0            return string             Matcher m   REMOVE TAGS matcher string       return m replaceAll

User · Answer

Another way is to use  javax swing text html HTMLEditorKit to extract the text   import java io    import javax swing text html    import javax swing text html parser     public class Html2Text extends HTMLEditorKit ParserCallback       StringBuffer s       public Html2Text                public void parse Reader in  throws IOException           s   new StringBuffer            ParserDelegator delegator   new ParserDelegator               the third parameter is TRUE to ignore charset directive         delegator parse in  this  Boolean TRUE              public void handleText char   text  int pos            s append text              public String getText             return s toString               public static void main String   args            try                  the HTML to convert             FileReader in   new FileReader  java-new html                Html2Text parser   new Html2Text                parser parse in               in close                System out println parser getText               catch  Exception e                e printStackTrace                        ref   Remove HTML tags from a file to extract only the TEXT

User · Answer

Remove HTML tags from string  Somewhere we need to parse some string which is received by some responses like Httpresponse from the server   So we need to parse it   Here I will show how to remove html tags from string          sample text with tags      string str     lt html gt  lt head gt sdfkashf sdf lt  head gt  lt body gt sdfasdf lt  body gt  lt  html gt              regex which match tags      System Text RegularExpressions Regex rx   new System Text RegularExpressions Regex   lt    gt    gt               replace all matches with empty strin      str   rx Replace str                now str contains string without html tags

User · Answer

On Android  try this   String result   Html fromHtml html  toString

User · Answer

You can simply use the Android s default HTML filter      public String htmlToStringFilter String textToFilter        return Html fromHtml textToFilter  toString             The above method will return the HTML filtered string for your input

User · Answer

Also very simple using Jericho  and you can retain some of the formatting  line breaks and links  for example        Source htmlSource   new Source htmlText       Segment htmlSeg   new Segment htmlSource  0  htmlSource length         Renderer htmlRend   new Renderer htmlSeg       System out println htmlRend toString

User · Answer

Use a HTML parser instead of regex  This is dead simple with Jsoup  public static String html2text String html        return Jsoup parse html  text       Jsoup also supports removing HTML tags against a customizable whitelist  which is very useful if you want to allow only e g   lt b gt    lt i gt  and  lt u gt   See also   RegEx match open tags except XHTML self-contained tags What are the pros and cons of the leading Java HTML parsers  XSS prevention in JSP Servlet web application

User · Answer

If you re writing for Android you can do this    android text HtmlCompat fromHtml instruction  HtmlCompat FROM HTML MODE LEGACY  toString

User · Answer

HTML Escaping is really hard to do right- I d definitely suggest using library code to do this  as it s a lot more subtle than you d think  Check out Apache s StringEscapeUtils for a pretty good library for handling this in Java

User · Answer

Here is one more variant of how to replace all HTML Tags    HTML Entities   Empty Space in HTML content   content replaceAll    lt     gt     amp            2           where content is a String

User · Answer

One way to retain new-line info with JSoup is to precede all new line tags with some dummy string  execute JSoup and replace dummy string with   n    String html     lt p gt Line one lt  p gt  lt p gt Line two lt  p gt Line three lt br  gt etc    String NEW LINE MARK    NEWLINESTART1234567890NEWLINEEND   for  String tag  new String     lt  p gt     lt br  gt     lt  h1 gt     lt  h2 gt     lt  h3 gt     lt  h4 gt     lt  h5 gt     lt  h6 gt     lt  li gt           html   html replace tag  NEW LINE MARK tag      String text   Jsoup parse html  text     text   text replace NEW LINE MARK          n n    text   text replace NEW LINE MARK    n n

User · Answer

Here s a lightly more fleshed out update to try to handle some formatting for breaks and lists  I used Amaya s output as a guide     import java io IOException  import java io Reader  import java io StringReader  import java util Stack  import java util logging Logger   import javax swing text MutableAttributeSet  import javax swing text html HTML  import javax swing text html HTMLEditorKit  import javax swing text html parser ParserDelegator   public class HTML2Text extends HTMLEditorKit ParserCallback       private static final Logger log   Logger              getLogger Logger GLOBAL LOGGER NAME        private StringBuffer stringBuffer       private Stack lt IndexType gt  indentStack       public static class IndexType           public String type          public int counter     used for ordered lists          public IndexType String type                this type   type              counter   0                       public HTML2Text             stringBuffer   new StringBuffer            indentStack   new Stack lt IndexType gt                public static String convert String html            HTML2Text parser   new HTML2Text            Reader in   new StringReader html           try                  the HTML to convert             parser parse in             catch  Exception e                log severe e getMessage               finally               try                   in close                  catch  IOException ioe                       this should never happen                                 return parser getText               public void parse Reader in  throws IOException           ParserDelegator delegator   new ParserDelegator               the third parameter is TRUE to ignore charset directive         delegator parse in  this  Boolean TRUE              public void handleStartTag HTML Tag t  MutableAttributeSet a  int pos            log info  StartTag     t toString             if  t toString   equals  p                  if  stringBuffer length    gt  0                      amp  amp   stringBuffer substring stringBuffer length   - 1                               equals   n                      newLine                              newLine              else if  t toString   equals  ol                  indentStack push new IndexType  ol                 newLine              else if  t toString   equals  ul                  indentStack push new IndexType  ul                 newLine              else if  t toString   equals  li                  IndexType parent   indentStack peek                if  parent type equals  ol                      String numberString           parent counter                         stringBuffer append numberString                   for  int i   0  i  lt   4 - numberString length     i                          stringBuffer append                                       else                   stringBuffer append                                    indentStack push new IndexType  li               else if  t toString   equals  dl                  newLine              else if  t toString   equals  dt                  newLine              else if  t toString   equals  dd                  indentStack push new IndexType  dd                 newLine                         private void newLine             stringBuffer append   n            for  int i   0  i  lt  indentStack size    i                  stringBuffer append                               public void handleEndTag HTML Tag t  int pos            log info  EndTag     t toString             if  t toString   equals  p                  newLine              else if  t toString   equals  ol                  indentStack pop                              newLine              else if  t toString   equals  ul                  indentStack pop                              newLine              else if  t toString   equals  li                  indentStack pop                              newLine              else if  t toString   equals  dd                  indentStack pop                                       public void handleSimpleTag HTML Tag t  MutableAttributeSet a  int pos            log info  SimpleTag     t toString             if  t toString   equals  br                  newLine                         public void handleText char   text  int pos            log info  Text     new String text            stringBuffer append text              public String getText             return stringBuffer toString               public static void main String args              String html     lt html gt  lt body gt  lt p gt paragraph at start lt  p gt hello lt br   gt What is happening  lt p gt this is a lt br   gt mutiline paragraph lt  p gt  lt ol gt    lt li gt This lt  li gt    lt li gt is lt  li gt    lt li gt an lt  li gt    lt li gt ordered lt  li gt    lt li gt list     lt p gt with lt  p gt      lt ul gt        lt li gt another lt  li gt        lt li gt list         lt dl gt            lt dt gt This lt  dt gt            lt dt gt is lt  dt gt              lt dd gt sdasd lt  dd gt              lt dd gt sdasda lt  dd gt              lt dd gt asda               lt p gt aasdas lt  p gt              lt  dd gt              lt dd gt sdada lt  dd gt            lt dt gt fsdfsdfsd lt  dt gt          lt  dl gt          lt dl gt            lt dt gt vbcvcvbcvb lt  dt gt            lt dt gt cvbcvbc lt  dt gt              lt dd gt vbcbcvbcvb lt  dd gt            lt dt gt cvbcv lt  dt gt            lt dt gt  lt  dt gt          lt  dl gt          lt dl gt            lt dt gt  lt  dt gt          lt  dl gt  lt  li gt        lt li gt cool lt  li gt      lt  ul gt      lt p gt stuff lt  p gt    lt  li gt    lt li gt cool lt  li gt  lt  ol gt  lt p gt  lt  p gt  lt  body gt  lt  html gt            System out println convert html

User · Answer

Sometimes the html string come from xml  with such  amp lt  When using Jsoup we need parse it and then clean it  Document doc   Jsoup parse htmlstrl   Whitelist wl   Whitelist none    String plain   Jsoup clean doc text    wl    While only using Jsoup parse htmlstrl  text   can t remove tags

User · Answer

Here is another way to do it   public static String removeHTML String input        int i   0      String   str   input split           String s           boolean inTag   false       for  i   input indexOf   lt     i  lt  input indexOf   gt     i              inTag   true            if   inTag            for  i   0  i  lt  str length  i                  s   s   str i                       return s

User · Answer

you can simply make a method with multiple replaceAll   like  String RemoveTag String html      html   html replaceAll     lt     gt          html   html replaceAll   amp nbsp           html   html replaceAll   amp amp           ----------    ----------    return html      Use this link for most common replacements you need  http   tunes org wiki html 20special 20characters 20and 20symbols html  It is simple but effective  I use this method first to remove the junk but not the very first line i e replaceAll    lt            and later i use specific keywords to search for indexes and then use  substring start  end  method to strip away unnecessary stuff  As this is more robust and you can pin point exactly what you need in the entire html page

User · Answer

You might want to replace  lt br  gt  and  lt  p gt  tags with newlines before stripping the HTML to prevent it becoming an illegible mess as Tim suggests   The only way I can think of removing HTML tags but leaving non-HTML between angle brackets would be check against a list of HTML tags  Something along these lines     replaceAll     lt   s  tag   gt    gt         Then HTML-decode special characters such as  amp amp   The result should not be considered to be sanitized

User · Answer

To get formateed plain html text you can do that   String BR ESCAPED     amp lt br  amp gt    Element el Jsoup parse html  select  body    el select  br   append BR ESCAPED   el select  p   append BR ESCAPED BR ESCAPED   el select  h1   append BR ESCAPED BR ESCAPED   el select  h2   append BR ESCAPED BR ESCAPED   el select  h3   append BR ESCAPED BR ESCAPED   el select  h4   append BR ESCAPED BR ESCAPED   el select  h5   append BR ESCAPED BR ESCAPED   String nodeValue el text    nodeValue nodeValue replaceAll BR ESCAPED    lt br  gt     nodeValue nodeValue replaceAll     s  lt br   gt    gt   3       lt br  gt  lt br  gt       To get formateed plain text change  lt br  gt  by  n and change last line by   nodeValue nodeValue replaceAll     s  n  3       lt br  gt  lt br  gt

User · Answer

I know it is been a while since this question as been asked  but I found another solution  this is what worked for me   Pattern REMOVE TAGS   Pattern compile   lt     gt         Source source  new Source htmlAsString    Matcher m   REMOVE TAGS matcher sourceStep getTextExtractor   toString                             String clearedHtml  m replaceAll

User · Answer

The accepted answer did not work for me for the test case I indicated  the result of  a  lt  b or b   c  is  a b or b   c    So  I used TagSoup instead   Here s a shot that worked for my test case  and a couple of others    import java io IOException  import java io StringReader  import java util logging Logger   import org ccil cowan tagsoup Parser  import org xml sax Attributes  import org xml sax ContentHandler  import org xml sax InputSource  import org xml sax Locator  import org xml sax SAXException  import org xml sax XMLReader          Take HTML and give back the text part while dropping the HTML tags        There is some risk that using TagSoup means we ll permute non-HTML text     However  it seems to work the best so far in test cases         author dan     see  lt a href  http   home ccil org  cowan XML tagsoup   gt TagSoup lt  a gt       public class Html2Text2 implements ContentHandler   private StringBuffer sb   public Html2Text2        public void parse String str  throws IOException  SAXException       XMLReader reader   new Parser        reader setContentHandler this       sb   new StringBuffer        reader parse new InputSource new StringReader str        public String getText         return sb toString        Override public void characters char   ch  int start  int length      throws SAXException       for  int idx   0  idx  lt  length  idx          sb append ch idx start              Override public void ignorableWhitespace char   ch  int start  int length      throws SAXException       sb append ch         The methods below do not contribute to the text  Override public void endDocument   throws SAXException       Override public void endElement String uri  String localName  String qName      throws SAXException       Override public void endPrefixMapping String prefix  throws SAXException        Override public void processingInstruction String target  String data      throws SAXException       Override public void setDocumentLocator Locator locator        Override public void skippedEntity String name  throws SAXException       Override public void startDocument   throws SAXException       Override public void startElement String uri  String localName  String qName      Attributes atts  throws SAXException       Override public void startPrefixMapping String prefix  String uri      throws SAXException

User · Answer

HTML Escaping is really hard to do right- I d definitely suggest using library code to do this  as it s a lot more subtle than you d think  Check out Apache s StringEscapeUtils for a pretty good library for handling this in Java

User · Answer

Here s a lightly more fleshed out update to try to handle some formatting for breaks and lists  I used Amaya s output as a guide     import java io IOException  import java io Reader  import java io StringReader  import java util Stack  import java util logging Logger   import javax swing text MutableAttributeSet  import javax swing text html HTML  import javax swing text html HTMLEditorKit  import javax swing text html parser ParserDelegator   public class HTML2Text extends HTMLEditorKit ParserCallback       private static final Logger log   Logger              getLogger Logger GLOBAL LOGGER NAME        private StringBuffer stringBuffer       private Stack lt IndexType gt  indentStack       public static class IndexType           public String type          public int counter     used for ordered lists          public IndexType String type                this type   type              counter   0                       public HTML2Text             stringBuffer   new StringBuffer            indentStack   new Stack lt IndexType gt                public static String convert String html            HTML2Text parser   new HTML2Text            Reader in   new StringReader html           try                  the HTML to convert             parser parse in             catch  Exception e                log severe e getMessage               finally               try                   in close                  catch  IOException ioe                       this should never happen                                 return parser getText               public void parse Reader in  throws IOException           ParserDelegator delegator   new ParserDelegator               the third parameter is TRUE to ignore charset directive         delegator parse in  this  Boolean TRUE              public void handleStartTag HTML Tag t  MutableAttributeSet a  int pos            log info  StartTag     t toString             if  t toString   equals  p                  if  stringBuffer length    gt  0                      amp  amp   stringBuffer substring stringBuffer length   - 1                               equals   n                      newLine                              newLine              else if  t toString   equals  ol                  indentStack push new IndexType  ol                 newLine              else if  t toString   equals  ul                  indentStack push new IndexType  ul                 newLine              else if  t toString   equals  li                  IndexType parent   indentStack peek                if  parent type equals  ol                      String numberString           parent counter                         stringBuffer append numberString                   for  int i   0  i  lt   4 - numberString length     i                          stringBuffer append                                       else                   stringBuffer append                                    indentStack push new IndexType  li               else if  t toString   equals  dl                  newLine              else if  t toString   equals  dt                  newLine              else if  t toString   equals  dd                  indentStack push new IndexType  dd                 newLine                         private void newLine             stringBuffer append   n            for  int i   0  i  lt  indentStack size    i                  stringBuffer append                               public void handleEndTag HTML Tag t  int pos            log info  EndTag     t toString             if  t toString   equals  p                  newLine              else if  t toString   equals  ol                  indentStack pop                              newLine              else if  t toString   equals  ul                  indentStack pop                              newLine              else if  t toString   equals  li                  indentStack pop                              newLine              else if  t toString   equals  dd                  indentStack pop                                       public void handleSimpleTag HTML Tag t  MutableAttributeSet a  int pos            log info  SimpleTag     t toString             if  t toString   equals  br                  newLine                         public void handleText char   text  int pos            log info  Text     new String text            stringBuffer append text              public String getText             return stringBuffer toString               public static void main String args              String html     lt html gt  lt body gt  lt p gt paragraph at start lt  p gt hello lt br   gt What is happening  lt p gt this is a lt br   gt mutiline paragraph lt  p gt  lt ol gt    lt li gt This lt  li gt    lt li gt is lt  li gt    lt li gt an lt  li gt    lt li gt ordered lt  li gt    lt li gt list     lt p gt with lt  p gt      lt ul gt        lt li gt another lt  li gt        lt li gt list         lt dl gt            lt dt gt This lt  dt gt            lt dt gt is lt  dt gt              lt dd gt sdasd lt  dd gt              lt dd gt sdasda lt  dd gt              lt dd gt asda               lt p gt aasdas lt  p gt              lt  dd gt              lt dd gt sdada lt  dd gt            lt dt gt fsdfsdfsd lt  dt gt          lt  dl gt          lt dl gt            lt dt gt vbcvcvbcvb lt  dt gt            lt dt gt cvbcvbc lt  dt gt              lt dd gt vbcbcvbcvb lt  dd gt            lt dt gt cvbcv lt  dt gt            lt dt gt  lt  dt gt          lt  dl gt          lt dl gt            lt dt gt  lt  dt gt          lt  dl gt  lt  li gt        lt li gt cool lt  li gt      lt  ul gt      lt p gt stuff lt  p gt    lt  li gt    lt li gt cool lt  li gt  lt  ol gt  lt p gt  lt  p gt  lt  body gt  lt  html gt            System out println convert html

User · Answer

You might want to replace  lt br  gt  and  lt  p gt  tags with newlines before stripping the HTML to prevent it becoming an illegible mess as Tim suggests   The only way I can think of removing HTML tags but leaving non-HTML between angle brackets would be check against a list of HTML tags  Something along these lines     replaceAll     lt   s  tag   gt    gt         Then HTML-decode special characters such as  amp amp   The result should not be considered to be sanitized

User · Answer

One could also use Apache Tika for this purpose  By default it preserves whitespaces from the stripped html  which may be desired in certain situations   InputStream htmlInputStream      HtmlParser htmlParser   new HtmlParser    HtmlContentHandler htmlContentHandler   new HtmlContentHandler    htmlParser parse htmlInputStream  htmlContentHandler  new Metadata    System out println htmlContentHandler getBodyText   trim

User · Answer

This should work -   use this    text replaceAll   lt     gt          - gt  This will replace all the html tags with a space    and this    text replaceAll   amp            - gt  this will replace all the tags which starts with   amp   and ends with     like  amp nbsp    amp amp    amp gt  etc

User · Answer

Another way is to use  javax swing text html HTMLEditorKit to extract the text   import java io    import javax swing text html    import javax swing text html parser     public class Html2Text extends HTMLEditorKit ParserCallback       StringBuffer s       public Html2Text                public void parse Reader in  throws IOException           s   new StringBuffer            ParserDelegator delegator   new ParserDelegator               the third parameter is TRUE to ignore charset directive         delegator parse in  this  Boolean TRUE              public void handleText char   text  int pos            s append text              public String getText             return s toString               public static void main String   args            try                  the HTML to convert             FileReader in   new FileReader  java-new html                Html2Text parser   new Html2Text                parser parse in               in close                System out println parser getText               catch  Exception e                e printStackTrace                        ref   Remove HTML tags from a file to extract only the TEXT

[java] Remove HTML tags from a String

Examples related to java

Examples related to html

Examples related to regex

Examples related to parsing