How do you convert Html to plain text

Question

I have snippets of Html stored in a table  Not entire pages  no  tags or the like  just basic formatting   I would like to be able to display that Html as text only  no formatting  on a given page  actually just the first 30 - 50 characters but that s the easy bit    How do I place the  text  within that Html into a string as straight text   So this piece of code    lt b gt Hello World  lt  b gt  lt br  gt  lt p gt  lt i gt Is there anyone out there  lt  i gt  lt p gt    Becomes   Hello World  Is there anyone out there

User · Answer

Did not write but an using   using HtmlAgilityPack  using System  using System IO  using System Text RegularExpressions   namespace foo       small but important modification to class https   github com zzzprojects html-agility-pack blob master src Samples Html2Txt HtmlConvert cs   public static class HtmlToText        public static string Convert string path          HtmlDocument doc   new HtmlDocument          doc Load path         return ConvertDoc doc              public static string ConvertHtml string html          HtmlDocument doc   new HtmlDocument          doc LoadHtml html         return ConvertDoc doc              public static string ConvertDoc HtmlDocument doc          using  StringWriter sw   new StringWriter              ConvertTo doc DocumentNode  sw           sw Flush            return sw ToString                       internal static void ConvertContentTo HtmlNode node  TextWriter outText  PreceedingDomTextInfo textInfo          foreach  HtmlNode subnode in node ChildNodes            ConvertTo subnode  outText  textInfo                     public static void ConvertTo HtmlNode node  TextWriter outText          ConvertTo node  outText  new PreceedingDomTextInfo false              internal static void ConvertTo HtmlNode node  TextWriter outText  PreceedingDomTextInfo textInfo          string html        switch  node NodeType            case HtmlNodeType Comment               don t output comments           break          case HtmlNodeType Document            ConvertContentTo node  outText  textInfo             break          case HtmlNodeType Text               script and style must not be output           string parentName   node ParentNode Name            if   parentName     script       parentName     style                  break                           get text           html     HtmlTextNode node  Text               is it in fact a special closing node output as text            if  HtmlNode IsOverlappedClosingElement html                 break                           check the text is meaningful and not a bunch of whitespaces           if  html Length    0                break                        if   textInfo WritePrecedingWhiteSpace    textInfo LastCharWasSpace                html   html TrimStart                if  html Length    0    break                textInfo IsFirstTextOfDocWritten Value   textInfo WritePrecedingWhiteSpace   true                        outText Write HtmlEntity DeEntitize Regex Replace html TrimEnd       s 2                       if  textInfo LastCharWasSpace   char IsWhiteSpace html html Length - 1                  outText Write                             break          case HtmlNodeType Element            string endElementString   null            bool isInline            bool skip   false            int listIndex   0            switch  node Name                case  nav                 skip   true                isInline   false                break              case  body               case  section               case  article               case  aside               case  h1               case  h2               case  header               case  footer               case  address               case  main               case  div               case  p      stylistic - adjust as you tend to use               if  textInfo IsFirstTextOfDocWritten                    outText Write   r n                                  endElementString     r n                 isInline   false                break              case  br                 outText Write   r n                  skip   true                textInfo WritePrecedingWhiteSpace   false                isInline   true                break              case  a                 if  node Attributes Contains  href                      string href   node Attributes  href   Value Trim                    if  node InnerText IndexOf href  StringComparison InvariantCultureIgnoreCase     -1                      endElementString     lt     href     gt                                                    isInline   true                break              case  li                 if  textInfo ListIndex  gt  0                    outText Write   r n 0   t   textInfo ListIndex                     else                   outText Write   r n  t      using     as bullet char  with tab after  but whatever you want eg   t- gt    if utf-8 0x2022                               isInline   false                break              case  ol                 listIndex   1                goto case  ul               case  ul     not handling nested lists any differently at this stage - that is getting close to rendering problems               endElementString     r n                 isInline   false                break              case  img     inline-block in reality               if  node Attributes Contains  alt                      outText Write       node Attributes  alt   Value                   endElementString                                      if  node Attributes Contains  src                      outText Write   lt     node Attributes  src   Value     gt                                   isInline   true                break              default                isInline   true                break                        if   skip  amp  amp  node HasChildNodes                ConvertContentTo node  outText  isInline   textInfo   new PreceedingDomTextInfo textInfo IsFirstTextOfDocWritten    ListIndex   listIndex                           if  endElementString    null                outText Write endElementString                         break                      internal class PreceedingDomTextInfo       public PreceedingDomTextInfo BoolWrapper isFirstTextOfDocWritten          IsFirstTextOfDocWritten   isFirstTextOfDocWritten            public bool WritePrecedingWhiteSpace   get  set        public bool LastCharWasSpace   get  set        public readonly BoolWrapper IsFirstTextOfDocWritten      public int ListIndex   get  set          internal class BoolWrapper       public BoolWrapper           public bool Value   get  set        public static implicit operator bool BoolWrapper boolWrapper          return boolWrapper Value            public static implicit operator BoolWrapper bool boolWrapper          return new BoolWrapper   Value   boolWrapper

User · Answer

The simplest way I found   HtmlFilter ConvertToPlainText html     The HtmlFilter class is located in Microsoft TeamFoundation WorkItemTracking Controls dll   The dll can be found in folder like this   ProgramFiles  Common Files microsoft shared Team Foundation Server 14 0   In VS 2015  the dll also requires reference to Microsoft TeamFoundation WorkItemTracking Common dll  located in the same folder

User · Answer

It has limitation that not collapsing long inline whitespace  but it is definitely portable and respects layout like webbrowser   static string HtmlToPlainText string html      string buf    string block    address article aside blockquote canvas dd div dl dt          fieldset figcaption figure footer form h  d header hr li main nav          noscript ol output p pre section table tfoot ul video      string patNestedBlock        s   lt     block     gt     gt     s      buf   Regex Replace html  patNestedBlock    n   RegexOptions IgnoreCase         Replace br tag to newline    buf   Regex Replace buf     lt  br    gt    gt      n   RegexOptions IgnoreCase          Optional  remove styles and scripts    buf   Regex Replace buf     lt  script style    gt     gt     lt   1 gt        RegexOptions Singleline         Remove all tags    buf   Regex Replace buf     lt    gt     gt           RegexOptions Multiline         Replace HTML entities    buf   WebUtility HtmlDecode buf     return buf

User · Answer

I think it has a simple answer   public string RemoveHTMLTags string HTMLCode        string str System Text RegularExpressions Regex Replace HTMLCode    lt    gt    gt             return str

User · Answer

The MIT licensed HtmlAgilityPack has in one of its samples a method that converts from HTML to plain text  var plainText   HtmlUtilities ConvertToPlainText string html    Feed it an HTML string like  lt b gt hello   lt i gt world  lt  i gt  lt  b gt   And you ll get a plain text result like  hello world

User · Answer

If you are talking about tag stripping  it is relatively straight forward if you don t have to worry about things like  lt script gt  tags   If all you need to do is display the text without the tags you can accomplish that with a regular expression    lt    gt    gt    If you do have to worry about  lt script gt  tags and the like then you ll need something a bit more powerful then regular expressions because you need to track state  omething more like a Context Free Grammar  CFG   Althought you might be able to accomplish it with  Left To Right  or non-greedy matching   If you can use regular expressions there are many web pages out there with good info    http   weblogs asp net rosherove archive 2003 05 13 6963 aspx http   www google com search hl en amp q html tag stripping  amp btnG Search   If you need the more complex behaviour of a CFG I would suggest using a third party tool  unfortunately I don t know of a good one to recommend

User · Answer

There not a method with the name  ConvertToPlainText  in the HtmlAgilityPack but you can convert a html string to CLEAR string with     HtmlDocument doc   new HtmlDocument    doc LoadHtml htmlString   var textString   doc DocumentNode InnerText  Regex Replace textString      lt    n    gt    string Empty  Replace   amp nbsp          Thats works for me  BUT I DONT FIND A METHOD WITH NAME  ConvertToPlainText  IN  HtmlAgilityPack

User · Answer

Depends on what you mean by  html    The most complex case would be complete web pages   That s also the easiest to handle  since you can use a text-mode web browser   See the Wikipedia article listing web browsers  including text mode browsers   Lynx is probably the best known  but one of the others may be better for your needs

User · Answer

Depends on what you mean by  html    The most complex case would be complete web pages   That s also the easiest to handle  since you can use a text-mode web browser   See the Wikipedia article listing web browsers  including text mode browsers   Lynx is probably the best known  but one of the others may be better for your needs

User · Answer

Did not write but an using   using HtmlAgilityPack  using System  using System IO  using System Text RegularExpressions   namespace foo       small but important modification to class https   github com zzzprojects html-agility-pack blob master src Samples Html2Txt HtmlConvert cs   public static class HtmlToText        public static string Convert string path          HtmlDocument doc   new HtmlDocument          doc Load path         return ConvertDoc doc              public static string ConvertHtml string html          HtmlDocument doc   new HtmlDocument          doc LoadHtml html         return ConvertDoc doc              public static string ConvertDoc HtmlDocument doc          using  StringWriter sw   new StringWriter              ConvertTo doc DocumentNode  sw           sw Flush            return sw ToString                       internal static void ConvertContentTo HtmlNode node  TextWriter outText  PreceedingDomTextInfo textInfo          foreach  HtmlNode subnode in node ChildNodes            ConvertTo subnode  outText  textInfo                     public static void ConvertTo HtmlNode node  TextWriter outText          ConvertTo node  outText  new PreceedingDomTextInfo false              internal static void ConvertTo HtmlNode node  TextWriter outText  PreceedingDomTextInfo textInfo          string html        switch  node NodeType            case HtmlNodeType Comment               don t output comments           break          case HtmlNodeType Document            ConvertContentTo node  outText  textInfo             break          case HtmlNodeType Text               script and style must not be output           string parentName   node ParentNode Name            if   parentName     script       parentName     style                  break                           get text           html     HtmlTextNode node  Text               is it in fact a special closing node output as text            if  HtmlNode IsOverlappedClosingElement html                 break                           check the text is meaningful and not a bunch of whitespaces           if  html Length    0                break                        if   textInfo WritePrecedingWhiteSpace    textInfo LastCharWasSpace                html   html TrimStart                if  html Length    0    break                textInfo IsFirstTextOfDocWritten Value   textInfo WritePrecedingWhiteSpace   true                        outText Write HtmlEntity DeEntitize Regex Replace html TrimEnd       s 2                       if  textInfo LastCharWasSpace   char IsWhiteSpace html html Length - 1                  outText Write                             break          case HtmlNodeType Element            string endElementString   null            bool isInline            bool skip   false            int listIndex   0            switch  node Name                case  nav                 skip   true                isInline   false                break              case  body               case  section               case  article               case  aside               case  h1               case  h2               case  header               case  footer               case  address               case  main               case  div               case  p      stylistic - adjust as you tend to use               if  textInfo IsFirstTextOfDocWritten                    outText Write   r n                                  endElementString     r n                 isInline   false                break              case  br                 outText Write   r n                  skip   true                textInfo WritePrecedingWhiteSpace   false                isInline   true                break              case  a                 if  node Attributes Contains  href                      string href   node Attributes  href   Value Trim                    if  node InnerText IndexOf href  StringComparison InvariantCultureIgnoreCase     -1                      endElementString     lt     href     gt                                                    isInline   true                break              case  li                 if  textInfo ListIndex  gt  0                    outText Write   r n 0   t   textInfo ListIndex                     else                   outText Write   r n  t      using     as bullet char  with tab after  but whatever you want eg   t- gt    if utf-8 0x2022                               isInline   false                break              case  ol                 listIndex   1                goto case  ul               case  ul     not handling nested lists any differently at this stage - that is getting close to rendering problems               endElementString     r n                 isInline   false                break              case  img     inline-block in reality               if  node Attributes Contains  alt                      outText Write       node Attributes  alt   Value                   endElementString                                      if  node Attributes Contains  src                      outText Write   lt     node Attributes  src   Value     gt                                   isInline   true                break              default                isInline   true                break                        if   skip  amp  amp  node HasChildNodes                ConvertContentTo node  outText  isInline   textInfo   new PreceedingDomTextInfo textInfo IsFirstTextOfDocWritten    ListIndex   listIndex                           if  endElementString    null                outText Write endElementString                         break                      internal class PreceedingDomTextInfo       public PreceedingDomTextInfo BoolWrapper isFirstTextOfDocWritten          IsFirstTextOfDocWritten   isFirstTextOfDocWritten            public bool WritePrecedingWhiteSpace   get  set        public bool LastCharWasSpace   get  set        public readonly BoolWrapper IsFirstTextOfDocWritten      public int ListIndex   get  set          internal class BoolWrapper       public BoolWrapper           public bool Value   get  set        public static implicit operator bool BoolWrapper boolWrapper          return boolWrapper Value            public static implicit operator BoolWrapper bool boolWrapper          return new BoolWrapper   Value   boolWrapper

User · Answer

For anyone looking for an exact solution to the OP question for a textual abbreviation of a given html document  without newlines and HTML tags  please find the solution below   Like with every proposed solution  there are some assumptions with the code below    script or style tags should not contain script and style tags as a part of script only major inline elements will be inlined without space  i e  he lt span gt ll lt  span gt o should output hello  List of inline tags  https   www w3schools com htmL html blocks asp   Considering the above  the following string extension with compiled regular expressions will output expected plain text with regard to html escaped characters and null on null input   public static class StringExtensions       public static string ConvertToPlain this string html                if  html    null                        return html                     html   scriptRegex Replace html  string Empty           html   inlineTagRegex Replace html  string Empty           html   tagRegex Replace html                html   HttpUtility HtmlDecode html           html   multiWhitespaceRegex Replace html                 return html Trim               private static readonly Regex inlineTagRegex   new Regex   lt      a span sub sup b i strong small big em label q    gt    gt    RegexOptions Compiled   RegexOptions Singleline       private static readonly Regex scriptRegex   new Regex   lt  script style    gt     gt     lt    1 gt    RegexOptions Compiled   RegexOptions Singleline       private static readonly Regex tagRegex   new Regex   lt    gt    gt    RegexOptions Compiled   RegexOptions Singleline       private static readonly Regex multiWhitespaceRegex   new Regex    s    RegexOptions Compiled   RegexOptions Singleline

User · Answer

I have faced similar problem  and found best solution   Below code works perfect for me     private string ConvertHtml Totext string source             try               string result          Remove HTML Development formatting        Replace line breaks with space        because browsers inserts space     result   source Replace   r                Replace line breaks with space        because browsers inserts space     result   result Replace   n                Remove step-formatting     result   result Replace   t   string Empty          Remove repeating spaces because browsers ignore them     result   System Text RegularExpressions Regex Replace result                                                                                   Remove the header  prepare first by clearing attributes      result   System Text RegularExpressions Regex Replace result                  lt     head    gt     gt     lt head gt                 System Text RegularExpressions RegexOptions IgnoreCase       result   System Text RegularExpressions Regex Replace result                   lt            head     gt      lt  head gt                 System Text RegularExpressions RegexOptions IgnoreCase       result   System Text RegularExpressions Regex Replace result                  lt head gt      lt  head gt    string Empty               System Text RegularExpressions RegexOptions IgnoreCase           remove all scripts  prepare first by clearing attributes      result   System Text RegularExpressions Regex Replace result                  lt     script    gt     gt     lt script gt                 System Text RegularExpressions RegexOptions IgnoreCase       result   System Text RegularExpressions Regex Replace result                   lt            script     gt      lt  script gt                 System Text RegularExpressions RegexOptions IgnoreCase         result   System Text RegularExpressions Regex Replace result                     lt script gt       lt script gt    lt  script gt       lt  script gt                    string Empty                 System Text RegularExpressions RegexOptions IgnoreCase       result   System Text RegularExpressions Regex Replace result                   lt script gt      lt  script gt    string Empty               System Text RegularExpressions RegexOptions IgnoreCase           remove all styles  prepare first by clearing attributes      result   System Text RegularExpressions Regex Replace result                  lt     style    gt     gt     lt style gt                 System Text RegularExpressions RegexOptions IgnoreCase       result   System Text RegularExpressions Regex Replace result                   lt            style     gt      lt  style gt                 System Text RegularExpressions RegexOptions IgnoreCase       result   System Text RegularExpressions Regex Replace result                  lt style gt      lt  style gt    string Empty               System Text RegularExpressions RegexOptions IgnoreCase           insert tabs in spaces of  lt td gt  tags     result   System Text RegularExpressions Regex Replace result                  lt     td    gt     gt     t                System Text RegularExpressions RegexOptions IgnoreCase           insert line breaks in places of  lt BR gt  and  lt LI gt  tags     result   System Text RegularExpressions Regex Replace result                  lt     br     gt     r                System Text RegularExpressions RegexOptions IgnoreCase       result   System Text RegularExpressions Regex Replace result                  lt     li     gt     r                System Text RegularExpressions RegexOptions IgnoreCase           insert line paragraphs  double line breaks  in place        if  lt P gt    lt DIV gt  and  lt TR gt  tags     result   System Text RegularExpressions Regex Replace result                  lt     div    gt     gt     r r                System Text RegularExpressions RegexOptions IgnoreCase       result   System Text RegularExpressions Regex Replace result                  lt     tr    gt     gt     r r                System Text RegularExpressions RegexOptions IgnoreCase       result   System Text RegularExpressions Regex Replace result                  lt     p    gt     gt     r r                System Text RegularExpressions RegexOptions IgnoreCase           Remove remaining tags like  lt a gt   links  images         comments etc - anything that s enclosed inside  lt   gt      result   System Text RegularExpressions Regex Replace result                  lt    gt    gt   string Empty               System Text RegularExpressions RegexOptions IgnoreCase           replace special characters      result   System Text RegularExpressions Regex Replace result                                      System Text RegularExpressions RegexOptions IgnoreCase        result   System Text RegularExpressions Regex Replace result                  amp bull                       System Text RegularExpressions RegexOptions IgnoreCase       result   System Text RegularExpressions Regex Replace result                  amp lsaquo     lt                 System Text RegularExpressions RegexOptions IgnoreCase       result   System Text RegularExpressions Regex Replace result                  amp rsaquo     gt                 System Text RegularExpressions RegexOptions IgnoreCase       result   System Text RegularExpressions Regex Replace result                  amp trade     tm                 System Text RegularExpressions RegexOptions IgnoreCase       result   System Text RegularExpressions Regex Replace result                  amp frasl                     System Text RegularExpressions RegexOptions IgnoreCase       result   System Text RegularExpressions Regex Replace result                  amp lt     lt                 System Text RegularExpressions RegexOptions IgnoreCase       result   System Text RegularExpressions Regex Replace result                  amp gt     gt                 System Text RegularExpressions RegexOptions IgnoreCase       result   System Text RegularExpressions Regex Replace result                  amp copy     c                 System Text RegularExpressions RegexOptions IgnoreCase       result   System Text RegularExpressions Regex Replace result                  amp reg     r                 System Text RegularExpressions RegexOptions IgnoreCase          Remove all others  More can be added  see        http   hotwired lycos com webmonkey reference special characters      result   System Text RegularExpressions Regex Replace result                  amp    2 6      string Empty               System Text RegularExpressions RegexOptions IgnoreCase           for testing       System Text RegularExpressions Regex Replace result               this txtRegex Text string Empty               System Text RegularExpressions RegexOptions IgnoreCase           make line breaking consistent     result   result Replace   n     r            Remove extra line breaks and tabs         replace over 2 breaks with 2 and over 4 tabs with 4         Prepare first to remove any whitespaces in between        the escaped characters and remove redundant tabs in between line breaks     result   System Text RegularExpressions Regex Replace result                  r       r     r r                System Text RegularExpressions RegexOptions IgnoreCase       result   System Text RegularExpressions Regex Replace result                  t       t     t t                System Text RegularExpressions RegexOptions IgnoreCase       result   System Text RegularExpressions Regex Replace result                  t       r     t r                System Text RegularExpressions RegexOptions IgnoreCase       result   System Text RegularExpressions Regex Replace result                  r       t     r t                System Text RegularExpressions RegexOptions IgnoreCase          Remove redundant tabs     result   System Text RegularExpressions Regex Replace result                  r   t    r     r r                System Text RegularExpressions RegexOptions IgnoreCase          Remove multiple tabs following a line break with just one tab     result   System Text RegularExpressions Regex Replace result                  r   t      r t                System Text RegularExpressions RegexOptions IgnoreCase          Initial replacement target string for line breaks     string breaks     r r r          Initial replacement target string for tabs     string tabs     t t t t t       for  int index 0  index lt result Length  index                  result   result Replace breaks    r r            result   result Replace tabs    t t t t            breaks   breaks     r           tabs   tabs     t                 That s it      return result    catch       MessageBox Show  Error        return source         Escape characters such as  n and  r had to be removed first because they cause regexes to cease working as expected   Moreover  to make the result string display correctly in the textbox  one might need to split it up and set textbox s Lines property instead of assigning to Text property   this txtResult Lines         StripHTML this txtSource Text  Split   r  ToCharArray      Source   https   www codeproject com Articles 11902 Convert-HTML-to-Plain-Text-2

User · Answer

If you have data that has HTML tags and you want to display it so that a person can SEE the tags  use HttpServerUtility  HtmlEncode   If you have data that has HTML tags in it and you want the user to see the tags rendered  then display the text as is    If the text represents an entire web page  use an IFRAME for it   If you have data that has HTML tags and you want to strip out the tags and just display the unformatted text  use a regular expression

User · Answer

public static string StripTags2 string html                        return html Replace   lt      lt    Replace        gt                By this you escape all   lt   and     in a string  Is this what you want

User · Answer

public static string StripTags2 string html                        return html Replace   lt      lt    Replace        gt                By this you escape all   lt   and     in a string  Is this what you want

User · Answer

I could not use HtmlAgilityPack  so I wrote a second best solution for myself  private static string HtmlToPlainText string html        const string tagWhiteSpace       gt      W  n  r   lt     matches one or more  white space or line breaks  between   gt   and   lt       const string stripFormatting      lt    gt     gt        match any character between   lt   and   gt    even when end tag is missing     const string lineBreak      lt  br BR  s 0 1    0 1  gt     matches   lt br gt   lt br  gt   lt br   gt   lt BR gt   lt BR  gt   lt BR   gt      var lineBreakRegex   new Regex lineBreak  RegexOptions Multiline       var stripFormattingRegex   new Regex stripFormatting  RegexOptions Multiline       var tagWhiteSpaceRegex   new Regex tagWhiteSpace  RegexOptions Multiline        var text   html        Decode html specific characters     text   System Net WebUtility HtmlDecode text          Remove tag whitespace line breaks     text   tagWhiteSpaceRegex Replace text    gt  lt           Replace  lt br   gt  with line breaks     text   lineBreakRegex Replace text  Environment NewLine         Strip formatting     text   stripFormattingRegex Replace text  string Empty        return text

User · Answer

If you are talking about tag stripping  it is relatively straight forward if you don t have to worry about things like  lt script gt  tags   If all you need to do is display the text without the tags you can accomplish that with a regular expression    lt    gt    gt    If you do have to worry about  lt script gt  tags and the like then you ll need something a bit more powerful then regular expressions because you need to track state  omething more like a Context Free Grammar  CFG   Althought you might be able to accomplish it with  Left To Right  or non-greedy matching   If you can use regular expressions there are many web pages out there with good info    http   weblogs asp net rosherove archive 2003 05 13 6963 aspx http   www google com search hl en amp q html tag stripping  amp btnG Search   If you need the more complex behaviour of a CFG I would suggest using a third party tool  unfortunately I don t know of a good one to recommend

User · Answer

If you are talking about tag stripping  it is relatively straight forward if you don t have to worry about things like  lt script gt  tags   If all you need to do is display the text without the tags you can accomplish that with a regular expression    lt    gt    gt    If you do have to worry about  lt script gt  tags and the like then you ll need something a bit more powerful then regular expressions because you need to track state  omething more like a Context Free Grammar  CFG   Althought you might be able to accomplish it with  Left To Right  or non-greedy matching   If you can use regular expressions there are many web pages out there with good info    http   weblogs asp net rosherove archive 2003 05 13 6963 aspx http   www google com search hl en amp q html tag stripping  amp btnG Search   If you need the more complex behaviour of a CFG I would suggest using a third party tool  unfortunately I don t know of a good one to recommend

User · Answer

Depends on what you mean by  html    The most complex case would be complete web pages   That s also the easiest to handle  since you can use a text-mode web browser   See the Wikipedia article listing web browsers  including text mode browsers   Lynx is probably the best known  but one of the others may be better for your needs

User · Answer

Here is my solution  public string StripHTML string html        if  string IsNullOrWhiteSpace html   return  quot  quot           could be stored in static variable     var regex   new Regex  quot  lt    gt    gt    s 2  quot   RegexOptions IgnoreCase       return System Web HttpUtility HtmlDecode regex Replace html   quot  quot        Example  StripHTML  quot  lt p class  test  style  color red   gt Here is my solution  lt  p gt  quot       output - gt  Here is my solution

User · Answer

The MIT licensed HtmlAgilityPack has in one of its samples a method that converts from HTML to plain text  var plainText   HtmlUtilities ConvertToPlainText string html    Feed it an HTML string like  lt b gt hello   lt i gt world  lt  i gt  lt  b gt   And you ll get a plain text result like  hello world

User · Answer

To add to vfilby s answer  you can just perform a RegEx replace within your code  no new classes are necessary   In case other newbies like myself stumple upon this question   using System Text RegularExpressions    Then     private string StripHtml string source            string output             get rid of HTML tags         output   Regex Replace source    lt    gt    gt    string Empty              get rid of multiple blank lines         output   Regex Replace output      s   n   string Empty  RegexOptions Multiline            return output

User · Answer

To add to vfilby s answer  you can just perform a RegEx replace within your code  no new classes are necessary   In case other newbies like myself stumple upon this question   using System Text RegularExpressions    Then     private string StripHtml string source            string output             get rid of HTML tags         output   Regex Replace source    lt    gt    gt    string Empty              get rid of multiple blank lines         output   Regex Replace output      s   n   string Empty  RegexOptions Multiline            return output

User · Answer

HTTPUtility HTMLEncode   is meant to handle encoding HTML tags as strings   It takes care of all the heavy lifting for you   From the MSDN Documentation      If characters such as blanks and punctuation are passed in an HTTP stream  they might be misinterpreted at the receiving end  HTML encoding converts characters that are not allowed in HTML into character-entity equivalents  HTML decoding reverses the encoding  For example  when embedded in a block of text  the characters  lt  and  gt   are encoded as  amp lt  and  amp gt  for HTTP transmission    HTTPUtility HTMLEncode   method  detailed here   public static void HtmlEncode    string s    TextWriter output     Usage   String TestString    This is a  lt Test String gt     StringWriter writer   new StringWriter    Server HtmlEncode TestString  writer   String EncodedString   writer ToString

User · Answer

public static string StripTags2 string html                        return html Replace   lt      lt    Replace        gt                By this you escape all   lt   and     in a string  Is this what you want

User · Answer

If you have data that has HTML tags and you want to display it so that a person can SEE the tags  use HttpServerUtility  HtmlEncode   If you have data that has HTML tags in it and you want the user to see the tags rendered  then display the text as is    If the text represents an entire web page  use an IFRAME for it   If you have data that has HTML tags and you want to strip out the tags and just display the unformatted text  use a regular expression

User · Answer

Three Step Process for converting HTML into Plain Text  First You need to Install Nuget Package For HtmlAgilityPack Second Create This class  public class HtmlToText       public HtmlToText                    public string Convert string path                HtmlDocument doc   new HtmlDocument            doc Load path            StringWriter sw   new StringWriter            ConvertTo doc DocumentNode  sw           sw Flush            return sw ToString               public string ConvertHtml string html                HtmlDocument doc   new HtmlDocument            doc LoadHtml html            StringWriter sw   new StringWriter            ConvertTo doc DocumentNode  sw           sw Flush            return sw ToString               private void ConvertContentTo HtmlNode node  TextWriter outText                foreach HtmlNode subnode in node ChildNodes                        ConvertTo subnode  outText                        public void ConvertTo HtmlNode node  TextWriter outText                string html          switch node NodeType                        case HtmlNodeType Comment                     don t output comments                 break               case HtmlNodeType Document                  ConvertContentTo node  outText                   break               case HtmlNodeType Text                     script and style must not be output                 string parentName   node ParentNode Name                  if   parentName     script       parentName     style                        break                      get text                 html     HtmlTextNode node  Text                      is it in fact a special closing node output as text                  if  HtmlNode IsOverlappedClosingElement html                       break                      check the text is meaningful and not a bunch of whitespaces                 if  html Trim   Length  gt  0                                        outText Write HtmlEntity DeEntitize html                                      break               case HtmlNodeType Element                  switch node Name                                        case  p                              treat paragraphs as crlf                         outText Write   r n                            break                                     if  node HasChildNodes                                        ConvertContentTo node  outText                                     break                      By using above class with reference to Judah Himango s answer  Third you need to create the Object of above class and Use ConvertHtml HTMLContent  Method for converting HTML into Plain Text rather than ConvertToPlainText string html    HtmlToText htt new HtmlToText    var plainText   htt ConvertHtml HTMLContent

User · Answer

HTTPUtility HTMLEncode   is meant to handle encoding HTML tags as strings   It takes care of all the heavy lifting for you   From the MSDN Documentation      If characters such as blanks and punctuation are passed in an HTTP stream  they might be misinterpreted at the receiving end  HTML encoding converts characters that are not allowed in HTML into character-entity equivalents  HTML decoding reverses the encoding  For example  when embedded in a block of text  the characters  lt  and  gt   are encoded as  amp lt  and  amp gt  for HTTP transmission    HTTPUtility HTMLEncode   method  detailed here   public static void HtmlEncode    string s    TextWriter output     Usage   String TestString    This is a  lt Test String gt     StringWriter writer   new StringWriter    Server HtmlEncode TestString  writer   String EncodedString   writer ToString

User · Answer

I could not use HtmlAgilityPack  so I wrote a second best solution for myself  private static string HtmlToPlainText string html        const string tagWhiteSpace       gt      W  n  r   lt     matches one or more  white space or line breaks  between   gt   and   lt       const string stripFormatting      lt    gt     gt        match any character between   lt   and   gt    even when end tag is missing     const string lineBreak      lt  br BR  s 0 1    0 1  gt     matches   lt br gt   lt br  gt   lt br   gt   lt BR gt   lt BR  gt   lt BR   gt      var lineBreakRegex   new Regex lineBreak  RegexOptions Multiline       var stripFormattingRegex   new Regex stripFormatting  RegexOptions Multiline       var tagWhiteSpaceRegex   new Regex tagWhiteSpace  RegexOptions Multiline        var text   html        Decode html specific characters     text   System Net WebUtility HtmlDecode text          Remove tag whitespace line breaks     text   tagWhiteSpaceRegex Replace text    gt  lt           Replace  lt br   gt  with line breaks     text   lineBreakRegex Replace text  Environment NewLine         Strip formatting     text   stripFormattingRegex Replace text  string Empty        return text

User · Answer

HTTPUtility HTMLEncode   is meant to handle encoding HTML tags as strings   It takes care of all the heavy lifting for you   From the MSDN Documentation      If characters such as blanks and punctuation are passed in an HTTP stream  they might be misinterpreted at the receiving end  HTML encoding converts characters that are not allowed in HTML into character-entity equivalents  HTML decoding reverses the encoding  For example  when embedded in a block of text  the characters  lt  and  gt   are encoded as  amp lt  and  amp gt  for HTTP transmission    HTTPUtility HTMLEncode   method  detailed here   public static void HtmlEncode    string s    TextWriter output     Usage   String TestString    This is a  lt Test String gt     StringWriter writer   new StringWriter    Server HtmlEncode TestString  writer   String EncodedString   writer ToString

User · Answer

I have faced similar problem  and found best solution   Below code works perfect for me     private string ConvertHtml Totext string source             try               string result          Remove HTML Development formatting        Replace line breaks with space        because browsers inserts space     result   source Replace   r                Replace line breaks with space        because browsers inserts space     result   result Replace   n                Remove step-formatting     result   result Replace   t   string Empty          Remove repeating spaces because browsers ignore them     result   System Text RegularExpressions Regex Replace result                                                                                   Remove the header  prepare first by clearing attributes      result   System Text RegularExpressions Regex Replace result                  lt     head    gt     gt     lt head gt                 System Text RegularExpressions RegexOptions IgnoreCase       result   System Text RegularExpressions Regex Replace result                   lt            head     gt      lt  head gt                 System Text RegularExpressions RegexOptions IgnoreCase       result   System Text RegularExpressions Regex Replace result                  lt head gt      lt  head gt    string Empty               System Text RegularExpressions RegexOptions IgnoreCase           remove all scripts  prepare first by clearing attributes      result   System Text RegularExpressions Regex Replace result                  lt     script    gt     gt     lt script gt                 System Text RegularExpressions RegexOptions IgnoreCase       result   System Text RegularExpressions Regex Replace result                   lt            script     gt      lt  script gt                 System Text RegularExpressions RegexOptions IgnoreCase         result   System Text RegularExpressions Regex Replace result                     lt script gt       lt script gt    lt  script gt       lt  script gt                    string Empty                 System Text RegularExpressions RegexOptions IgnoreCase       result   System Text RegularExpressions Regex Replace result                   lt script gt      lt  script gt    string Empty               System Text RegularExpressions RegexOptions IgnoreCase           remove all styles  prepare first by clearing attributes      result   System Text RegularExpressions Regex Replace result                  lt     style    gt     gt     lt style gt                 System Text RegularExpressions RegexOptions IgnoreCase       result   System Text RegularExpressions Regex Replace result                   lt            style     gt      lt  style gt                 System Text RegularExpressions RegexOptions IgnoreCase       result   System Text RegularExpressions Regex Replace result                  lt style gt      lt  style gt    string Empty               System Text RegularExpressions RegexOptions IgnoreCase           insert tabs in spaces of  lt td gt  tags     result   System Text RegularExpressions Regex Replace result                  lt     td    gt     gt     t                System Text RegularExpressions RegexOptions IgnoreCase           insert line breaks in places of  lt BR gt  and  lt LI gt  tags     result   System Text RegularExpressions Regex Replace result                  lt     br     gt     r                System Text RegularExpressions RegexOptions IgnoreCase       result   System Text RegularExpressions Regex Replace result                  lt     li     gt     r                System Text RegularExpressions RegexOptions IgnoreCase           insert line paragraphs  double line breaks  in place        if  lt P gt    lt DIV gt  and  lt TR gt  tags     result   System Text RegularExpressions Regex Replace result                  lt     div    gt     gt     r r                System Text RegularExpressions RegexOptions IgnoreCase       result   System Text RegularExpressions Regex Replace result                  lt     tr    gt     gt     r r                System Text RegularExpressions RegexOptions IgnoreCase       result   System Text RegularExpressions Regex Replace result                  lt     p    gt     gt     r r                System Text RegularExpressions RegexOptions IgnoreCase           Remove remaining tags like  lt a gt   links  images         comments etc - anything that s enclosed inside  lt   gt      result   System Text RegularExpressions Regex Replace result                  lt    gt    gt   string Empty               System Text RegularExpressions RegexOptions IgnoreCase           replace special characters      result   System Text RegularExpressions Regex Replace result                                      System Text RegularExpressions RegexOptions IgnoreCase        result   System Text RegularExpressions Regex Replace result                  amp bull                       System Text RegularExpressions RegexOptions IgnoreCase       result   System Text RegularExpressions Regex Replace result                  amp lsaquo     lt                 System Text RegularExpressions RegexOptions IgnoreCase       result   System Text RegularExpressions Regex Replace result                  amp rsaquo     gt                 System Text RegularExpressions RegexOptions IgnoreCase       result   System Text RegularExpressions Regex Replace result                  amp trade     tm                 System Text RegularExpressions RegexOptions IgnoreCase       result   System Text RegularExpressions Regex Replace result                  amp frasl                     System Text RegularExpressions RegexOptions IgnoreCase       result   System Text RegularExpressions Regex Replace result                  amp lt     lt                 System Text RegularExpressions RegexOptions IgnoreCase       result   System Text RegularExpressions Regex Replace result                  amp gt     gt                 System Text RegularExpressions RegexOptions IgnoreCase       result   System Text RegularExpressions Regex Replace result                  amp copy     c                 System Text RegularExpressions RegexOptions IgnoreCase       result   System Text RegularExpressions Regex Replace result                  amp reg     r                 System Text RegularExpressions RegexOptions IgnoreCase          Remove all others  More can be added  see        http   hotwired lycos com webmonkey reference special characters      result   System Text RegularExpressions Regex Replace result                  amp    2 6      string Empty               System Text RegularExpressions RegexOptions IgnoreCase           for testing       System Text RegularExpressions Regex Replace result               this txtRegex Text string Empty               System Text RegularExpressions RegexOptions IgnoreCase           make line breaking consistent     result   result Replace   n     r            Remove extra line breaks and tabs         replace over 2 breaks with 2 and over 4 tabs with 4         Prepare first to remove any whitespaces in between        the escaped characters and remove redundant tabs in between line breaks     result   System Text RegularExpressions Regex Replace result                  r       r     r r                System Text RegularExpressions RegexOptions IgnoreCase       result   System Text RegularExpressions Regex Replace result                  t       t     t t                System Text RegularExpressions RegexOptions IgnoreCase       result   System Text RegularExpressions Regex Replace result                  t       r     t r                System Text RegularExpressions RegexOptions IgnoreCase       result   System Text RegularExpressions Regex Replace result                  r       t     r t                System Text RegularExpressions RegexOptions IgnoreCase          Remove redundant tabs     result   System Text RegularExpressions Regex Replace result                  r   t    r     r r                System Text RegularExpressions RegexOptions IgnoreCase          Remove multiple tabs following a line break with just one tab     result   System Text RegularExpressions Regex Replace result                  r   t      r t                System Text RegularExpressions RegexOptions IgnoreCase          Initial replacement target string for line breaks     string breaks     r r r          Initial replacement target string for tabs     string tabs     t t t t t       for  int index 0  index lt result Length  index                  result   result Replace breaks    r r            result   result Replace tabs    t t t t            breaks   breaks     r           tabs   tabs     t                 That s it      return result    catch       MessageBox Show  Error        return source         Escape characters such as  n and  r had to be removed first because they cause regexes to cease working as expected   Moreover  to make the result string display correctly in the textbox  one might need to split it up and set textbox s Lines property instead of assigning to Text property   this txtResult Lines         StripHTML this txtSource Text  Split   r  ToCharArray      Source   https   www codeproject com Articles 11902 Convert-HTML-to-Plain-Text-2

User · Answer

If you have data that has HTML tags and you want to display it so that a person can SEE the tags  use HttpServerUtility  HtmlEncode   If you have data that has HTML tags in it and you want the user to see the tags rendered  then display the text as is    If the text represents an entire web page  use an IFRAME for it   If you have data that has HTML tags and you want to strip out the tags and just display the unformatted text  use a regular expression

User · Answer

For anyone looking for an exact solution to the OP question for a textual abbreviation of a given html document  without newlines and HTML tags  please find the solution below   Like with every proposed solution  there are some assumptions with the code below    script or style tags should not contain script and style tags as a part of script only major inline elements will be inlined without space  i e  he lt span gt ll lt  span gt o should output hello  List of inline tags  https   www w3schools com htmL html blocks asp   Considering the above  the following string extension with compiled regular expressions will output expected plain text with regard to html escaped characters and null on null input   public static class StringExtensions       public static string ConvertToPlain this string html                if  html    null                        return html                     html   scriptRegex Replace html  string Empty           html   inlineTagRegex Replace html  string Empty           html   tagRegex Replace html                html   HttpUtility HtmlDecode html           html   multiWhitespaceRegex Replace html                 return html Trim               private static readonly Regex inlineTagRegex   new Regex   lt      a span sub sup b i strong small big em label q    gt    gt    RegexOptions Compiled   RegexOptions Singleline       private static readonly Regex scriptRegex   new Regex   lt  script style    gt     gt     lt    1 gt    RegexOptions Compiled   RegexOptions Singleline       private static readonly Regex tagRegex   new Regex   lt    gt    gt    RegexOptions Compiled   RegexOptions Singleline       private static readonly Regex multiWhitespaceRegex   new Regex    s    RegexOptions Compiled   RegexOptions Singleline

User · Answer

Depends on what you mean by  html    The most complex case would be complete web pages   That s also the easiest to handle  since you can use a text-mode web browser   See the Wikipedia article listing web browsers  including text mode browsers   Lynx is probably the best known  but one of the others may be better for your needs

User · Answer

The simplest way I found   HtmlFilter ConvertToPlainText html     The HtmlFilter class is located in Microsoft TeamFoundation WorkItemTracking Controls dll   The dll can be found in folder like this   ProgramFiles  Common Files microsoft shared Team Foundation Server 14 0   In VS 2015  the dll also requires reference to Microsoft TeamFoundation WorkItemTracking Common dll  located in the same folder

User · Answer

HTTPUtility HTMLEncode   is meant to handle encoding HTML tags as strings   It takes care of all the heavy lifting for you   From the MSDN Documentation      If characters such as blanks and punctuation are passed in an HTTP stream  they might be misinterpreted at the receiving end  HTML encoding converts characters that are not allowed in HTML into character-entity equivalents  HTML decoding reverses the encoding  For example  when embedded in a block of text  the characters  lt  and  gt   are encoded as  amp lt  and  amp gt  for HTTP transmission    HTTPUtility HTMLEncode   method  detailed here   public static void HtmlEncode    string s    TextWriter output     Usage   String TestString    This is a  lt Test String gt     StringWriter writer   new StringWriter    Server HtmlEncode TestString  writer   String EncodedString   writer ToString

User · Answer

If you have data that has HTML tags and you want to display it so that a person can SEE the tags  use HttpServerUtility  HtmlEncode   If you have data that has HTML tags in it and you want the user to see the tags rendered  then display the text as is    If the text represents an entire web page  use an IFRAME for it   If you have data that has HTML tags and you want to strip out the tags and just display the unformatted text  use a regular expression

User · Answer

I think the easiest way is to make a  string  extension method  based on what user Richard have suggested    using System  using System Text RegularExpressions   public static class StringHelpers       public static string StripHTML this string HTMLText                        var reg   new Regex   lt    gt    gt    RegexOptions IgnoreCase               return reg Replace HTMLText                     Then just use this extension method on any  string  variable in your program   var yourHtmlString     lt div class   someclass   gt  lt h2 gt yourHtmlText lt  h2 gt  lt  span gt    var yourTextString   yourHtmlString StripHTML      I use this extension method to convert html formated comments to plain text so it will be displayed correctly on a crystal report  and it works perfect

User · Answer

public static string StripTags2 string html                        return html Replace   lt      lt    Replace        gt                By this you escape all   lt   and     in a string  Is this what you want

User · Answer

I had the same question  just my html had a simple pre-known layout  like    lt DIV gt  lt P gt abc lt  P gt  lt P gt def lt  P gt  lt  DIV gt    So I ended up using such simple code   string Join  Environment NewLine  XDocument Parse  html  Root Elements    Select  el   gt  el Value     Which outputs   abc def

User · Answer

It has limitation that not collapsing long inline whitespace  but it is definitely portable and respects layout like webbrowser   static string HtmlToPlainText string html      string buf    string block    address article aside blockquote canvas dd div dl dt          fieldset figcaption figure footer form h  d header hr li main nav          noscript ol output p pre section table tfoot ul video      string patNestedBlock        s   lt     block     gt     gt     s      buf   Regex Replace html  patNestedBlock    n   RegexOptions IgnoreCase         Replace br tag to newline    buf   Regex Replace buf     lt  br    gt    gt      n   RegexOptions IgnoreCase          Optional  remove styles and scripts    buf   Regex Replace buf     lt  script style    gt     gt     lt   1 gt        RegexOptions Singleline         Remove all tags    buf   Regex Replace buf     lt    gt     gt           RegexOptions Multiline         Replace HTML entities    buf   WebUtility HtmlDecode buf     return buf

User · Answer

If you are talking about tag stripping  it is relatively straight forward if you don t have to worry about things like  lt script gt  tags   If all you need to do is display the text without the tags you can accomplish that with a regular expression    lt    gt    gt    If you do have to worry about  lt script gt  tags and the like then you ll need something a bit more powerful then regular expressions because you need to track state  omething more like a Context Free Grammar  CFG   Althought you might be able to accomplish it with  Left To Right  or non-greedy matching   If you can use regular expressions there are many web pages out there with good info    http   weblogs asp net rosherove archive 2003 05 13 6963 aspx http   www google com search hl en amp q html tag stripping  amp btnG Search   If you need the more complex behaviour of a CFG I would suggest using a third party tool  unfortunately I don t know of a good one to recommend

User · Answer

There not a method with the name  ConvertToPlainText  in the HtmlAgilityPack but you can convert a html string to CLEAR string with     HtmlDocument doc   new HtmlDocument    doc LoadHtml htmlString   var textString   doc DocumentNode InnerText  Regex Replace textString      lt    n    gt    string Empty  Replace   amp nbsp          Thats works for me  BUT I DONT FIND A METHOD WITH NAME  ConvertToPlainText  IN  HtmlAgilityPack

User · Answer

I think it has a simple answer   public string RemoveHTMLTags string HTMLCode        string str System Text RegularExpressions Regex Replace HTMLCode    lt    gt    gt             return str

User · Answer

Three Step Process for converting HTML into Plain Text  First You need to Install Nuget Package For HtmlAgilityPack Second Create This class  public class HtmlToText       public HtmlToText                    public string Convert string path                HtmlDocument doc   new HtmlDocument            doc Load path            StringWriter sw   new StringWriter            ConvertTo doc DocumentNode  sw           sw Flush            return sw ToString               public string ConvertHtml string html                HtmlDocument doc   new HtmlDocument            doc LoadHtml html            StringWriter sw   new StringWriter            ConvertTo doc DocumentNode  sw           sw Flush            return sw ToString               private void ConvertContentTo HtmlNode node  TextWriter outText                foreach HtmlNode subnode in node ChildNodes                        ConvertTo subnode  outText                        public void ConvertTo HtmlNode node  TextWriter outText                string html          switch node NodeType                        case HtmlNodeType Comment                     don t output comments                 break               case HtmlNodeType Document                  ConvertContentTo node  outText                   break               case HtmlNodeType Text                     script and style must not be output                 string parentName   node ParentNode Name                  if   parentName     script       parentName     style                        break                      get text                 html     HtmlTextNode node  Text                      is it in fact a special closing node output as text                  if  HtmlNode IsOverlappedClosingElement html                       break                      check the text is meaningful and not a bunch of whitespaces                 if  html Trim   Length  gt  0                                        outText Write HtmlEntity DeEntitize html                                      break               case HtmlNodeType Element                  switch node Name                                        case  p                              treat paragraphs as crlf                         outText Write   r n                            break                                     if  node HasChildNodes                                        ConvertContentTo node  outText                                     break                      By using above class with reference to Judah Himango s answer  Third you need to create the Object of above class and Use ConvertHtml HTMLContent  Method for converting HTML into Plain Text rather than ConvertToPlainText string html    HtmlToText htt new HtmlToText    var plainText   htt ConvertHtml HTMLContent

User · Answer

Here is my solution  public string StripHTML string html        if  string IsNullOrWhiteSpace html   return  quot  quot           could be stored in static variable     var regex   new Regex  quot  lt    gt    gt    s 2  quot   RegexOptions IgnoreCase       return System Web HttpUtility HtmlDecode regex Replace html   quot  quot        Example  StripHTML  quot  lt p class  test  style  color red   gt Here is my solution  lt  p gt  quot       output - gt  Here is my solution

User · Answer

I had the same question  just my html had a simple pre-known layout  like    lt DIV gt  lt P gt abc lt  P gt  lt P gt def lt  P gt  lt  DIV gt    So I ended up using such simple code   string Join  Environment NewLine  XDocument Parse  html  Root Elements    Select  el   gt  el Value     Which outputs   abc def

User · Answer

I think the easiest way is to make a  string  extension method  based on what user Richard have suggested    using System  using System Text RegularExpressions   public static class StringHelpers       public static string StripHTML this string HTMLText                        var reg   new Regex   lt    gt    gt    RegexOptions IgnoreCase               return reg Replace HTMLText                     Then just use this extension method on any  string  variable in your program   var yourHtmlString     lt div class   someclass   gt  lt h2 gt yourHtmlText lt  h2 gt  lt  span gt    var yourTextString   yourHtmlString StripHTML      I use this extension method to convert html formated comments to plain text so it will be displayed correctly on a crystal report  and it works perfect

[c#] How do you convert Html to plain text?

Examples related to c#

Examples related to asp.net

Examples related to html