Regular expression to remove HTML tags from a string

Question

Possible Duplicate    Regular expression to remove HTML tags       Is there an expression which will get the value between two HTML tags   Given this    lt td class  played  gt 0 lt  td gt    I am looking for an expression which will return 0  stripping the  lt td gt  tags

User · Answer

You could do it with jsoup http   jsoup org   Whitelist whitelist   Whitelist none    String cleanStr   Jsoup clean yourText  whitelist

User · Answer

You should not attempt to parse HTML with regex  HTML is not a regular language  so any regex you come up with will likely fail on some esoteric edge case  Please refer to the seminal answer to this question for specifics  While mostly formatted as a joke  it makes a very good point     The following examples are Java  but the regex will be similar -- if not identical -- for other languages     String target   someString replaceAll   lt    gt    gt           Assuming your non-html does not contain any  lt  or   and that your input string is correctly structured   If you know they re a specific tag -- for example you know the text contains only  lt td gt  tags  you could do something like this   String target   someString replaceAll    i  lt td   gt    gt           Edit   Omega brought up a good point in a comment on another post that this would result in multiple results all being squished together if there were multiple tags   For example  if the input string were  lt td gt Something lt  td gt  lt td gt Another Thing lt  td gt   then the above would result in SomethingAnother Thing   In a situation where multiple tags are expected  we could do something like    String target   someString replaceAll    i  lt td   gt    gt         replaceAll    s         trim      This replaces the HTML with a single space  then collapses whitespace  and then trims any on the ends

User · Answer

A trivial approach would be to replace   lt    gt    gt    with nothing  But depending on how ill-structured your input is that may well fail

[html] Regular expression to remove HTML tags from a string

Examples related to html

Examples related to regex