How can I Convert HTML to Text in C

Question

I m looking for C  code to convert an HTML document to plain text     I m not looking for simple tag stripping   but something that will output plain text with a reasonable preservation of the original layout   The output should look like this   Html2Txt at W3C  I ve looked at the HTML Agility Pack  but I don t think that s what I need  Does anyone have any other suggestions   EDIT  I just download the HTML Agility Pack from CodePlex  and ran the Html2Txt project   What a disappointment  at least the module that does html to text conversion    All it did was strip the tags  flatten the tables  etc   The output didn t look anything like the Html2Txt   W3C produced   Too bad that source doesn t seem to be available  I was looking to see if there is a more  canned  solution available   EDIT 2  Thank you everybody for your suggestions   FlySwat tipped me in the direction i wanted to go   I can use the System Diagnostics Process class to run lynx exe with the  -dump  switch to send the text to standard output  and capture the stdout with ProcessStartInfo UseShellExecute   false and ProcessStartInfo RedirectStandardOutput   true   I ll wrap all this in a C  class   This code will be called only occassionly  so i m not too concerned about spawning a new process vs  doing it in code   Plus  Lynx is FAST

User · Answer

I ve heard from a reliable source that  if you re doing HTML parsing in  Net  you should look at the HTML agility pack again     http   www codeplex com htmlagilitypack  Some sample on SO     HTML Agility pack - parsing tables

User · Answer

I had some decoding issues with HtmlAgility and I didn t want to invest time investigating it   Instead I used that utility from the Microsoft Team Foundation API   var text   HtmlFilter ConvertToPlainText htmlContent

User · Answer

If you are using  NET framework 4 5 you can use System Net WebUtility HtmlDecode   which takes a HTML encoded string and returns a decoded string   Documented on MSDN at  http   msdn microsoft com en-us library system net webutility htmldecode v vs 110  aspx  You can use this in a Windows Store app as well

User · Answer

Just a note about the HtmlAgilityPack for posterity  The project contains an example of parsing text to html  which  as noted by the OP  does not handle whitespace at all like anyone writing HTML would envisage  There are full-text rendering solutions out there  noted by others to this question  which this is not  it cannot even handle tables in its current form   but it is lightweight and fast  which is all I wanted for creating a simple text version of HTML emails    using System IO  using System Text RegularExpressions  using HtmlAgilityPack     small but important modification to class https   github com zzzprojects html-agility-pack blob master src Samples Html2Txt HtmlConvert cs public static class HtmlToText        public static string Convert string path                HtmlDocument doc   new HtmlDocument            doc Load path           return ConvertDoc doc              public static string ConvertHtml string html                HtmlDocument doc   new HtmlDocument            doc LoadHtml html           return ConvertDoc doc              public static string ConvertDoc  HtmlDocument doc                using  StringWriter sw   new StringWriter                          ConvertTo doc DocumentNode  sw               sw Flush                return sw ToString                         internal static void ConvertContentTo HtmlNode node  TextWriter outText  PreceedingDomTextInfo textInfo                foreach  HtmlNode subnode in node ChildNodes                        ConvertTo subnode  outText  textInfo                       public static void ConvertTo HtmlNode node  TextWriter outText                ConvertTo node  outText  new PreceedingDomTextInfo false              internal static void ConvertTo HtmlNode node  TextWriter outText  PreceedingDomTextInfo textInfo                string html          switch  node NodeType                        case HtmlNodeType Comment                     don t output comments                 break              case HtmlNodeType Document                  ConvertContentTo node  outText  textInfo                   break              case HtmlNodeType Text                     script and style must not be output                 string parentName   node ParentNode Name                  if   parentName     script       parentName     style                                          break                                       get text                 html     HtmlTextNode node  Text                     is it in fact a special closing node output as text                  if  HtmlNode IsOverlappedClosingElement html                                         break                                       check the text is meaningful and not a bunch of whitespaces                 if  html Length    0                                        break                                    if   textInfo WritePrecedingWhiteSpace    textInfo LastCharWasSpace                                        html  html TrimStart                        if  html Length    0    break                        textInfo IsFirstTextOfDocWritten Value   textInfo WritePrecedingWhiteSpace   true                                    outText Write HtmlEntity DeEntitize Regex Replace html TrimEnd       s 2                             if  textInfo LastCharWasSpace   char IsWhiteSpace html html Length - 1                                          outText Write                                             break              case HtmlNodeType Element                  string endElementString   null                  bool isInline                  bool skip   false                  int listIndex   0                  switch  node Name                                        case  nav                           skip   true                          isInline   false                          break                      case  body                       case  section                       case  article                       case  aside                       case  h1                       case  h2                       case  header                       case  footer                       case  address                       case  main                       case  div                       case  p      stylistic - adjust as you tend to use                         if  textInfo IsFirstTextOfDocWritten                                                        outText Write   r n                                                      endElementString     r n                           isInline   false                          break                      case  br                           outText Write   r n                            skip   true                          textInfo WritePrecedingWhiteSpace   false                          isInline   true                          break                      case  a                           if  node Attributes Contains  href                                                          string href   node Attributes  href   Value Trim                                if  node InnerText IndexOf href  StringComparison InvariantCultureIgnoreCase   -1                                                                endElementString      lt     href     gt                                                                                      isInline   true                          break                      case  li                            if textInfo ListIndex gt 0                                                        outText Write   r n 0   t   textInfo ListIndex                                                        else                                                       outText Write   r n  t      using     as bullet char  with tab after  but whatever you want eg   t- gt    if utf-8 0x2022                                                   isInline   false                          break                      case  ol                            listIndex   1                          goto case  ul                       case  ul     not handling nested lists any differently at this stage - that is getting close to rendering problems                         endElementString     r n                           isInline   false                          break                      case  img     inline-block in reality                         if  node Attributes Contains  alt                                                          outText Write       node Attributes  alt   Value                               endElementString                                                          if  node Attributes Contains  src                                                          outText Write   lt     node Attributes  src   Value     gt                                                       isInline   true                          break                      default                          isInline   true                          break                                    if   skip  amp  amp  node HasChildNodes                                        ConvertContentTo node  outText  isInline   textInfo   new PreceedingDomTextInfo textInfo IsFirstTextOfDocWritten   ListIndex   listIndex                                       if  endElementString    null                                        outText Write endElementString                                     break                    internal class PreceedingDomTextInfo       public PreceedingDomTextInfo BoolWrapper isFirstTextOfDocWritten                IsFirstTextOfDocWritten   isFirstTextOfDocWritten            public bool WritePrecedingWhiteSpace  get set       public bool LastCharWasSpace   get  set        public readonly BoolWrapper IsFirstTextOfDocWritten      public int ListIndex   get  set      internal class BoolWrapper       public BoolWrapper           public bool Value   get  set        public static implicit operator bool BoolWrapper boolWrapper                return boolWrapper Value            public static implicit operator BoolWrapper bool boolWrapper                return new BoolWrapper  Value   boolWrapper              As an example  the following HTML code      lt  DOCTYPE HTML gt   lt html gt       lt head gt       lt  head gt       lt body gt           lt header gt              Whatever Inc           lt  header gt           lt main gt               lt p gt                  Thanks for your enquiry  As this is the 1 lt sup gt st lt  sup gt  time you have contacted us  we would like to clarify a few things               lt  p gt               lt ol gt                   lt li gt                      Please confirm this is your email by replying                   lt  li gt                   lt li gt                      Then perform this step                   lt  li gt               lt  ol gt               lt p gt                  Please solve this  lt img alt  complex equation  src  http   upload wikimedia org wikipedia commons 8 8d First Equation Ever png   gt   Then  in any order  could you please               lt  p gt               lt ul gt                   lt li gt                      a point                   lt  li gt                   lt li gt                      another point  with a  lt a href  http   en wikipedia org wiki Hyperlink  gt hyperlink lt  a gt                    lt  li gt               lt  ul gt               lt p gt                  Sincerely               lt  p gt               lt p gt                  The whatever com team              lt  p gt           lt  main gt           lt footer gt              Ph  000 000 000 lt br  gt              mail  whatever st          lt  footer gt       lt  body gt   lt  html gt       will be transformed into   Whatever Inc     Thanks for your enquiry  As this is the 1st time you have contacted us  we would like to clarify a few things    1   Please confirm this is your email by replying   2   Then perform this step    Please solve this  complex equation lt http   upload wikimedia org wikipedia commons 8 8d First Equation Ever png gt    Then  in any order  could you please        a point       another point  with a hyperlink lt http   en wikipedia org wiki Hyperlink gt     Sincerely    The whatever com team    Ph  000 000 000 mail  whatever st       as opposed to           Whatever Inc                Thanks for your enquiry  As this is the 1st time you have contacted us  we would like to clarify a few things                   Please confirm this is your email by replying                   Then perform this step                Please solve this   Then  in any order  could you please                   a point                   another point  with a hyperlink                Sincerely                The whatever com team          Ph  000 000 000         mail  whatever st

User · Answer

This function convert  What You See in the browser  to plain text with line breaks   If you want to see result in the browser just use commented return value   public string HtmlFileToText string filePath        using  var browser   new WebBrowser                  string text   File ReadAllText filePath           browser ScriptErrorsSuppressed   true          browser Navigate  about blank            browser  Document  OpenNew false           browser  Document  Write text           return browser Document  Body  InnerText            return browser Document  Body  InnerText Replace Environment NewLine    lt br   gt

User · Answer

Have you tried http   www aaronsw com 2002 html2text  it s Python  but open source

User · Answer

You can use WebBrowser control to render in memory your html content  After LoadCompleted event fired     IHTMLDocument2 htmlDoc    IHTMLDocument2 webBrowser Document  string innerHTML   htmlDoc body innerHTML  string innerText   htmlDoc body innerText

User · Answer

I have recently blogged on a solution that worked for me by using a Markdown XSLT file to transform the HTML Source  The HTML source will of course need to be valid XML first

User · Answer

What you are looking for is a text-mode DOM renderer that outputs text  much like Lynx or other Text browsers   This is much harder to do than you would expect

User · Answer

Here is the short sweet answer using HtmlAgilityPack    You can run this in LinqPad   var html     lt div gt   whatever html lt  div gt    var doc   new HtmlAgilityPack HtmlDocument    doc LoadHtml html   var plainText   doc DocumentNode InnerText    I simply use HtmlAgilityPack in any  NET project that needs HTML parsing   It s simple  reliable  and fast

User · Answer

I have used Detagger in the past   It does a pretty good job of formatting the HTML as text and is more than just a tag remover

User · Answer

You could use this    public static string StripHTML string HTMLText  bool decode   true                        Regex reg   new Regex   lt    gt    gt    RegexOptions IgnoreCase               var stripped   reg Replace HTMLText                   return decode   HttpUtility HtmlDecode stripped    stripped              Updated  Thanks for the comments I have updated to improve this function

User · Answer

Another post suggests the HTML agility pack      This is an agile HTML parser that   builds a read write DOM and supports   plain XPATH or XSLT  you actually   don t HAVE to understand XPATH nor   XSLT to use it  don t worry      It is   a  NET code library that allows you to   parse  out of the web  HTML files  The   parser is very tolerant with  real   world  malformed HTML  The object   model is very similar to what proposes   System Xml  but for HTML documents  or   streams

User · Answer

The easiest would probably be tag stripping combined with replacement of some tags with text layout elements like dashes for list elements  li  and line breaks for br s and p s  It shouldn t be too hard to extend this to tables

User · Answer

In Genexus You can made with Regex   amp pattern     lt           amp TSTRPNOT  amp TSTRPNOT ReplaceRegEx  amp pattern      In Genexus possiamo gestirlo con Regex

User · Answer

Assuming you have well formed html  you could also maybe try an XSL transform   Here s an example   using System  using System IO  using System Xml Linq  using System Xml XPath  using System Xml Xsl   class Html2TextExample       public static string Html2Text XDocument source                var writer   new StringWriter            Html2Text source  writer           return writer ToString               public static void Html2Text XDocument source  TextWriter output                Transformer Transform source CreateReader    null  output              public static XslCompiledTransform  transformer      public static XslCompiledTransform Transformer               get                       if   transformer    null                                 transformer   new XslCompiledTransform                    var xsl   XDocument Parse    lt  xml version  1 0   gt  lt xsl stylesheet version   1 0   xmlns xsl   http   www w3 org 1999 XSL Transform   exclude-result-prefixes   xsl   gt  lt xsl output method   html   indent   yes   version   4 0   omit-xml-declaration   yes   encoding   UTF-8     gt  lt xsl template match       gt  lt xsl value-of select         gt  lt  xsl template gt  lt  xsl stylesheet gt                      transformer Load xsl CreateNavigator                               return  transformer                       static void Main string   args                var html   XDocument Parse   lt html gt  lt body gt  lt div gt Hello world  lt  div gt  lt  body gt  lt  html gt             var text   Html2Text html           Console WriteLine text

User · Answer

Because I wanted conversion to plain text with LF and bullets  I found this pretty solution on codeproject  which covers many conversion usecases   Convert HTML to Plain Text  Yep  looks so big  but works fine

User · Answer

This is another solution to convert HTML to Text or RTF in C        SautinSoft HtmlToRtf h   new SautinSoft HtmlToRtf        h OutputFormat   HtmlToRtf eOutputFormat TextUnicode      string text   h ConvertString htmlString     This library is not free  this is commercial product and it is my own product

User · Answer

Try the easy and usable way    just call StripHTML WebBrowserControl name       public string StripHTML WebBrowser webp                        try                               doc execCommand  SelectAll   true  null                   IHTMLSelectionObject currentSelection   doc selection                   if  currentSelection    null                                        IHTMLTxtRange range   currentSelection createRange   as IHTMLTxtRange                      if  range    null                                                currentSelection empty                            return range text                                                                    catch  Exception ep                                  MessageBox Show ep Message                             return

User · Answer

I don t know C   but there is a fairly small  amp  easy to read python html2txt script here  http   www aaronsw com 2002 html2text

[c#] How can I Convert HTML to Text in C#?

Examples related to c#

Examples related to html

Examples related to .net

Examples related to parsing

Examples related to text