Remove HTML tags from string including nbsp in C

Question

How can I remove all the HTML tags including  amp nbsp using regex in C   My string looks like      lt div gt hello lt  div gt  lt div gt  lt br gt  lt  div gt  lt div gt  lt br gt  lt  div gt  lt div gt  lt br gt  lt  div gt  lt div gt  lt br gt  lt  div gt  lt div gt  lt br gt  lt  div gt  lt div gt  lt br gt  lt  div gt  lt div gt  lt br gt  lt  div gt  lt div gt  lt br gt  lt  div gt  lt div gt  amp nbsp   amp nbsp   amp nbsp   amp nbsp   amp nbsp   amp nbsp   amp nbsp   amp nbsp   amp nbsp   amp nbsp   amp nbsp   amp nbsp   amp nbsp   amp nbsp   amp nbsp   amp nbsp   amp nbsp   amp nbsp   amp nbsp   amp nbsp   amp nbsp   amp nbsp   amp nbsp   amp nbsp   amp nbsp   amp nbsp   amp nbsp  amp nbsp  lt  div gt  lt div gt  lt br gt  lt  div gt  lt div gt  lt br gt  lt  div gt  lt div gt  lt br gt  lt  div gt  lt div gt  lt br gt  lt  div gt  lt div gt  lt br gt  lt  div gt  lt div gt  lt br gt  lt  div gt  lt div gt  lt br gt  lt  div gt  lt div gt  lt br gt  lt  div gt  lt div gt  lt br gt  lt  div gt  lt div gt  lt br gt  lt  div gt  lt div gt  lt br gt  lt  div gt  lt div gt  lt br gt  lt  div gt  lt div gt  lt br gt  lt  div gt

User · Accepted Answer

If you can t use an HTML parser oriented solution to filter out the tags  here s a simple regex for it   string noHTML   Regex Replace inputHTML     lt    gt    gt   amp nbsp        Trim      You should ideally make another pass through a regex filter that takes care of multiple spaces as  string noHTMLNormalised   Regex Replace noHTML     s 2

User · Answer

I ve been using this function for a while  Removes pretty much any messy html you can throw at it and leaves the text intact           private static readonly Regex  tags    new Regex    lt    gt     gt    RegexOptions Multiline   RegexOptions Compiled              add characters that are should not be removed to this regex         private static readonly Regex  notOkCharacter    new Regex      w  amp                -    RegexOptions Compiled            public static String UnHtml String html                        html   HttpUtility UrlDecode html               html   HttpUtility HtmlDecode html                html   RemoveTag html    lt  --    -- gt                 html   RemoveTag html    lt script     lt  script gt                 html   RemoveTag html    lt style     lt  style gt                    replace matches of these regexes with space             html    tags  Replace html                    html    notOkCharacter  Replace html                    html   SingleSpacedTrim html                return html                     private static String RemoveTag String html  String startTag  String endTag                        Boolean bAgain              do                               bAgain   false                  Int32 startTagPos   html IndexOf startTag  0  StringComparison CurrentCultureIgnoreCase                   if  startTagPos  lt  0                      continue                  Int32 endTagPos   html IndexOf endTag  startTagPos   1  StringComparison CurrentCultureIgnoreCase                   if  endTagPos  lt   startTagPos                      continue                  html   html Remove startTagPos  endTagPos - startTagPos   endTag Length                   bAgain   true                while  bAgain               return html                     private static String SingleSpacedTrim String inString                        StringBuilder sb   new StringBuilder                Boolean inBlanks   false              foreach  Char c in inString                                switch  c                                        case   r                       case   n                       case   t                       case                              if   inBlanks                                                        inBlanks   true                              sb Append                                                            continue                      default                          inBlanks   false                          sb Append c                           break                                              return sb ToString   Trim

User · Answer

lt     gt     gt   amp nbsp     You can test it here  https   regex101 com r kB0rQ4 1

User · Answer

I have used the  RaviThapliyal  amp   Don Rolling s code but made a little modification  Since we are replacing the  amp nbsp with empty string but instead  amp nbsp should be replaced with space  so added an additional step  It worked for me like a charm   public static string FormatString string value        var step1   Regex Replace value     lt    gt    gt        Trim        var step2   Regex Replace step1     amp nbsp              var step3   Regex Replace step2     s 2               return step3      Used  amp nbps without semicolon because it was getting formatted by the Stack Overflow

User · Answer

HTML is in its basic form just XML  You could Parse your text in an XmlDocument object  and on the root element call InnerText to extract the text  This will strip all HTML tages in any form and also deal with special characters like  amp lt   amp nbsp  all in one go

User · Answer

I took  Ravi Thapliyal s code and made a method  It is simple and might not clean everything  but so far it is doing what I need it to do   public static string ScrubHtml string value        var step1   Regex Replace value     lt    gt    gt   amp nbsp        Trim        var step2   Regex Replace step1     s 2               return step2

User · Answer

var noHtml   Regex Replace inputHTML     lt    gt     gt      amp nbsp   amp zwnj   amp raquo   amp laquo    string Empty  Trim

User · Answer

this     lt     gt     amp nbsp     will match any tag or  amp nbsp   string regex       lt     gt   amp nbsp     var x   Regex Replace originalString  regex      Trim      then x   hello

User · Answer

Sanitizing an Html document involves a lot of tricky things  This package maybe of help  https   github com mganss HtmlSanitizer

[c#] Remove HTML tags from string including &nbsp in C#

Examples related to c#

Examples related to html

Examples related to regex

Examples related to string