How can I strip HTML tags from a string in ASP NET

Question

Using ASP NET  how can I strip the HTML tags from a given string reliably  i e  not using regex   I am looking for something like PHP s strip tags  Example   lt ul gt  lt li gt Hello lt  li gt  lt  ul gt  Output   quot Hello quot  I am trying not to reinvent the wheel  but I have not found anything that meets my needs so far

User · Accepted Answer

If it is just stripping all HTML tags from a string  this works reliably with regex as well  Replace    lt    gt     gt       with the empty string  globally  Don t forget to normalize the string afterwards  replacing     s r n     with a single space  and trimming the result  Optionally replace any HTML character entities back to the actual characters   Note     There is a limitation  HTML and XML allow  gt  in attribute values  This solution will return broken markup when encountering such values  The solution is technically safe  as in  The result will never contain anything that could be used to do cross site scripting or to break a page layout  It is just not very clean  As with all things HTML and regex  Use a proper parser if you must get it right under all circumstances

User · Answer

protected string StripHtml string Txt        return Regex Replace Txt    lt      n    gt    string Empty          Protected Function StripHtml Txt as String  as String     Return Regex Replace Txt    lt     n    gt    String Empty  End Function

User · Answer

I ve posted this on the asp net forums  and it still seems to be one of the easiest solutions out there  I won t guarantee it s the fastest or most efficient  but it s pretty reliable   In  NET you can use the HTML Web Control objects themselves  All you really need to do is insert your string into a temporary HTML object such as a DIV  then use the built-in  InnerText  to grab all text that is not contained within tags  See below for a simple C  example    System Web UI HtmlControls HtmlGenericControl htmlDiv   new System Web UI HtmlControls HtmlGenericControl  div    htmlDiv InnerHtml   htmlString  String plainText   htmlDiv InnerText

User · Answer

using System Text RegularExpressions   string str   Regex Replace HttpUtility HtmlDecode HTMLString     lt     gt    string Empty

User · Answer

You can also do this with AngleSharp which is an alternative to HtmlAgilityPack  not that HAP is bad    It is easier to use than HAP to get the text out of a HTML source   var parser   new HtmlParser    var htmlDocument   parser ParseDocument source   var text   htmlDocument Body Text      You can take a look at the key features section where they make a case at being  better  than HAP   I think for the most part  it is probably overkill for the current question but still  it is an interesting alternative

User · Answer

Simply use string StripHTML

User · Answer

For those who are complining about Michael Tiptop s solution not working  here is the  Net4  way of doing it   public static string StripTags this string markup        try               StringReader sr   new StringReader markup           XPathDocument doc          using  XmlReader xr   XmlReader Create sr                             new XmlReaderSettings                                                               ConformanceLevel   ConformanceLevel Fragment                                   for multiple roots                                                      doc   new XPathDocument xr                      return doc CreateNavigator   Value      Value is similar to  InnerText of                                                  XmlDocument or JavaScript s innerText           catch               return string Empty

User · Answer

For those of you who can t use the HtmlAgilityPack   NETs XML reader is an option  This can fail on well formatted HTML though so always add a catch with regx as a backup  Note this is NOT fast  but it does provide a nice opportunity for old school step through debugging   public static string RemoveHTMLTags string content                var cleaned   string Empty          try                       StringBuilder textOnly   new StringBuilder                using  var reader   XmlNodeReader Create new System IO StringReader   lt xml gt     content     lt  xml gt                                    while  reader Read                                          if  reader NodeType    XmlNodeType Text                          textOnly Append reader ReadContentAsString                                                 cleaned   textOnly ToString                      catch                         A tag is probably not closed  fallback to regex string clean              string textOnly   string Empty              Regex tagRemove   new Regex    lt    gt     gt                    Regex compressSpaces   new Regex     s r n                  textOnly   tagRemove Replace content  string Empty               textOnly   compressSpaces Replace textOnly                    cleaned   textOnly                     return cleaned

User · Answer

For the second parameter i e  keep some tags  you may need some code like this by using HTMLagilityPack   public string StripTags HtmlNode documentNode  IList keepTags        var result   new StringBuilder            foreach  var childNode in documentNode ChildNodes                        if  childNode Name ToLower        text                                 result Append childNode InnerText                             else                               if   keepTags Contains childNode Name ToLower                                           result Append StripTags childNode  keepTags                                      else                                       result Append childNode OuterHtml Replace childNode InnerHtml  StripTags childNode  keepTags                                                       return result ToString            More explanation on this page  http   nalgorithm com 2015 11 20 strip-html-tags-of-an-html-in-c-strip html-php-equivalent

User · Answer

Regex Replace htmlText    lt     gt    string Empty

User · Answer

string result   Regex Replace anytext     lt     n    gt    string Empty

User · Answer

Go download HTMLAgilityPack  now      Download LInk  This allows you to load and parse HTML   Then you can navigate the DOM and extract the inner values of all attributes   Seriously  it will take you about 10 lines of code at the maximum   It is one of the greatest free  net libraries out there   Here is a sample               string htmlContents   new System IO StreamReader resultsStream Encoding UTF8 true  ReadToEnd                 HtmlAgilityPack HtmlDocument doc   new HtmlAgilityPack HtmlDocument                doc LoadHtml htmlContents               if  doc    null  return null               string output                   foreach  var node in doc DocumentNode ChildNodes                                output    node InnerText

User · Answer

I ve looked at the Regex based solutions suggested here  and they don t fill me with any confidence except in the most trivial cases  An angle bracket in an attribute is all it would take to break  let alone mal-formmed HTML from the wild  And what about entities like  amp amp   If you want to convert HTML into plain text  you need to decode entities too   So I propose the method below   Using HtmlAgilityPack  this extension method efficiently strips all HTML tags from an html fragment  Also decodes HTML entities like  amp amp   Returns just the inner text items  with a new line between each text item   public static string RemoveHtmlTags this string html            if  String IsNullOrEmpty html               return html           var doc   new HtmlAgilityPack HtmlDocument            doc LoadHtml html            if  doc DocumentNode    null    doc DocumentNode ChildNodes    null                        return WebUtility HtmlDecode html                      var sb   new StringBuilder             var i   0           foreach  var node in doc DocumentNode ChildNodes                        var text   node InnerText SafeTrim                 if   String IsNullOrEmpty text                                 sb Append text                    if  i  lt  doc DocumentNode ChildNodes Count - 1                                        sb Append Environment NewLine                                                i                       var result   sb ToString             return WebUtility HtmlDecode result      public static string SafeTrim this string str        if  str    null          return null       return str Trim        If you are really serious  you d want to ignore the contents of certain HTML tags too   lt script gt    lt style gt    lt svg gt    lt head gt    lt object gt  come to mind   because they probably don t contain readable content in the sense we are after  What you do there will depend on your circumstances and how far you want to go  but using HtmlAgilityPack it would be pretty trivial to whitelist or blacklist selected tags   If you are rendering the content back to an HTML page  make sure you understand XSS vulnerability  amp  how to prevent it - i e  always encode any user-entered text that gets rendered back onto an HTML page   gt  becomes  amp gt  etc

User · Answer

I have written a pretty fast method in c  which beats the hell out of the Regex  It is hosted in an article on CodeProject   Its advantages are  among better performance the ability to replace named and numbered HTML entities  those like  amp amp amp  and  amp 203   and comment blocks replacement and more   Please read the related article on CodeProject   Thank you

[c#] How can I strip HTML tags from a string in ASP.NET?

Examples related to c#

Examples related to asp.net

Examples related to html

Examples related to regex

Examples related to string