Using C regular expressions to remove HTML tags

Question

How do I use C  regular expression to replace remove all HTML tags  including the angle brackets  Can someone please help me with the code

User · Answer

Add     in  lt    gt    gt  and try this regex  base on this     lt    gt      gt    c   net regex demo

User · Answer

Regex regex   new Regex    lt    w    s  w   s   s                        gt  s        s   s     gt    RegexOptions Singleline     Source

User · Answer

I would like to echo Jason s response though sometimes you need to naively parse some Html and pull out the text content    I needed to do this with some Html which had been created by a rich text editor  always fun and games    In this case you may need to remove the content of some tags as well as just the tags themselves    In my case  and  tags were thrown into this mix  Some one may find my  very slightly  less naive  implementation a useful starting point           lt summary gt          Removes all html tags from string and leaves only plain text         Removes content of  lt xml gt  lt  xml gt  and  lt style gt  lt  style gt  tags as aim to get text content not markup  meta data           lt  summary gt           lt param name  input  gt  lt  param gt           lt returns gt  lt  returns gt      public static string HtmlStrip this string input                input   Regex Replace input    lt style gt     n    lt  style gt   string Empty           input   Regex Replace input     lt xml gt     n    lt  xml gt    string Empty      remove all  lt xml gt  lt  xml gt  tags and anything inbetween            return Regex Replace input     lt     n    gt    string Empty      remove any tags but not there content   lt p gt bob lt span gt  johnson lt  span gt  lt  p gt   becomes  bob johnson

User · Answer

As often stated before  you should not use regular expressions to process XML or HTML documents  They do not perform very well with HTML and XML documents  because there is no way to express nested structures in a general way   You could use the following   String result   Regex Replace htmlDocument     lt    gt    gt    String Empty     This will work for most cases  but there will be cases  for example CDATA containing angle brackets  where this will not work as expected

User · Answer

The correct answer is don t do that  use the HTML Agility Pack   Edited to add   To shamelessly steal from the comment below by jesse  and to avoid being accused of inadequately answering the question after all this time  here s a simple  reliable snippet using the HTML Agility Pack that works with even most imperfectly formed  capricious bits of HTML   HtmlDocument doc   new HtmlDocument    doc LoadHtml Properties Resources HtmlContents   var text   doc DocumentNode SelectNodes    body  text     Select node   gt  node InnerText   StringBuilder output   new StringBuilder    foreach  string line in text       output AppendLine line     string textOnly   HttpUtility HtmlDecode output ToString       There are very few defensible cases for using a regular expression for parsing HTML  as HTML can t be parsed correctly without a context-awareness that s very painful to provide even in a nontraditional regex engine  You can get part way there with a RegEx  but you ll need to do manual verifications   Html Agility Pack can provide you a robust solution that will reduce the need to manually fix up the aberrations that can result from naively treating HTML as a context-free grammar   A regular expression may get you mostly what you want most of the time  but it will fail on very common cases  If you can find a better faster parser than HTML Agility Pack  go for it  but please don t subject the world to more broken HTML hackery

User · Answer

try regular expression method at this URL  http   www dotnetperls com remove-html-tags        lt summary gt      Remove HTML from string with Regex       lt  summary gt  public static string StripTagsRegex string source    return Regex Replace source    lt     gt    string Empty           lt summary gt      Compiled regular expression for performance       lt  summary gt  static Regex  htmlRegex   new Regex   lt     gt    RegexOptions Compiled         lt summary gt      Remove HTML from string with compiled Regex       lt  summary gt  public static string StripTagsRegexCompiled string source    return  htmlRegex Replace source  string Empty

User · Answer

The question is too broad to be answered definitively   Are you talking about removing all tags from a real-world HTML document  like a web page   If so  you would have to    remove the  lt  DOCTYPE declaration or  lt  xml prolog if they exist remove all SGML comments remove the entire HEAD element remove all SCRIPT and STYLE elements do Grabthar-knows-what with FORM and TABLE elements remove the remaining tags remove the  lt   CDATA  and    gt  sequences from CDATA sections but leave their contents alone   That s just off the top of my head--I m sure there s more   Once you ve done all that  you ll end up with words  sentences and paragraphs run together in some places  and big chunks of useless whitespace in others    But  assuming you re working with just a fragment and you can get away with simply removing all tags  here s the regex I would use        gt  lt    w     gt       gt                             gt     Matching single- and double-quoted strings in their own alternatives is sufficient to deal with the problem of angle brackets in attribute values   I don t see any need to explicitly match the attribute names and other stuff inside the tag  like the regex in Ryan s answer does  the first alternative handles all of that   In case you re wondering about those    gt      constructs  they re atomic groups   They make the regex a little more efficient  but more importantly  they prevent runaway backtracking  which is something you should always watch out for when you mix alternation and nested quantifiers as I ve done   I don t really think that would be a problem here  but I know if I don t mention it  someone else will   -   This regex isn t perfect  of course  but it s probably as good as you ll ever need

User · Answer

use this         gt  lt    w     gt       gt                             gt

User · Answer

Use this method to remove tags   public string From To string text  string from  string to        if  text    null          return null      string pattern         from           to      Regex rx   new Regex pattern  RegexOptions Compiled   RegexOptions IgnoreCase       MatchCollection matches   rx Matches text       return matches Count  lt   0   text   matches Cast lt Match gt    Where match   gt   string IsNullOrEmpty match Value   Aggregate text   current  match    gt  current Replace match Value

User · Answer

JasonTrue is correct  that stripping HTML tags should not be done via regular expressions   It s quite simple to strip HTML tags using HtmlAgilityPack   public string StripTags string input        var doc   new HtmlDocument        doc LoadHtml input             return doc DocumentNode InnerText

[c#] Using C# regular expressions to remove HTML tags

Examples related to c#

Examples related to html

Examples related to regex

Examples related to parsing