What is the best way to parse html in C

Question

I m looking for a library method to parse an html file with more html specific features than generic xml parsing libraries

User · Answer

You could use a HTML DTD  and the generic XML parsing libraries

User · Answer

No 3rd party lib  WebBrowser class solution that can run on Console  and Asp net  using System  using System Collections Generic  using System Text  using System Windows Forms  using System Threading   class ParseHTML       public ParseHTML           private string ReturnString       public string doParsing string html                Thread t   new Thread TParseMain           t ApartmentState   ApartmentState STA          t Start  object html           t Join            return ReturnString             private void TParseMain object html                WebBrowser wbc   new WebBrowser            wbc DocumentText    feces of a dummy             magic words                 HtmlDocument doc   wbc Document OpenNew true           doc Write  string html           this ReturnString   doc Body InnerHtml     do here something           return            usage   string myhtml     lt HTML gt  lt BODY gt This is a new HTML document  lt  BODY gt  lt  HTML gt    Console WriteLine  before     myhtml   myhtml    new ParseHTML    doParsing myhtml   Console WriteLine  after     myhtml

User · Answer

I found a project called Fizzler that takes a jQuery Sizzler approach to selecting HTML elements   It s based on HTML Agility Pack   It s currently in beta and only supports a subset of CSS selectors  but it s pretty damn cool and refreshing to use CSS selectors over nasty XPath   http   code google com p fizzler

User · Answer

The trouble with parsing HTML is that it isn t an exact science  If it was XHTML that you were parsing  then things would be a lot easier  as you mention you could use a general XML parser   Because HTML isn t necessarily well-formed XML you will come into lots of problems trying to parse it  It almost needs to be done on a site-by-site basis

User · Answer

I think  Erlend s use of HTMLDocument is the best way to go  However  I have also had good luck using this simple library   SgmlReader

User · Answer

Html Agility Pack     This is an agile HTML parser that builds a read write DOM and supports plain XPATH or XSLT  you actually don t HAVE to understand XPATH nor XSLT to use it  don t worry      It is a  NET code library that allows you to parse  out of the web  HTML files  The parser is very tolerant with  real world  malformed HTML  The object model is very similar to what proposes System Xml  but for HTML documents  or streams

User · Answer

I ve written some code that provides  LINQ to HTML  functionality   I thought I would share it here   It is based on Majestic 12   It takes the Majestic-12 results and produces LINQ XML elements   At that point you can use all your LINQ to XML tools against the HTML   As an example           IEnumerable lt XNode gt  auctionNodes   Majestic12ToXml Majestic12ToXml ConvertNodesToXml byteArrayOfAuctionHtml            foreach  XElement anchorTag in auctionNodes OfType lt XElement gt    DescendantsAndSelf  a                   if  anchorTag Attribute  href      null                  continue               Console WriteLine anchorTag Attribute  href   Value               I wanted to use Majestic-12 because I know it has a lot of built-in knowledge with regards to HTML that is found in the wild   What I ve found though is that to map the Majestic-12 results to something that LINQ will accept as XML requires additional work   The code I m including does a lot of this cleansing  but as you use this you will find pages that are rejected   You ll need to fix up the code to address that   When an exception is thrown  check exception Data  source   as it is likely set to the HTML tag that caused the exception   Handling the HTML in a nice manner is at times not trivial     So now that expectations are realistically low  here s the code     using System  using System Collections Generic  using System Linq  using System Text  using Majestic12  using System IO  using System Xml Linq  using System Diagnostics  using System Text RegularExpressions   namespace Majestic12ToXml   public class Majestic12ToXml        static public IEnumerable lt XNode gt  ConvertNodesToXml byte   htmlAsBytes             HTMLparser parser   OpenParser            parser Init htmlAsBytes            XElement currentNode   new XElement  document             HTMLchunk m12chunk   null           int xmlnsAttributeIndex   0          string originalHtml                while   m12chunk   parser ParseNext       null                 try                    Debug Assert  m12chunk bHashMode       popular default for Majestic-12 setting                  XNode newNode   null                  XElement newNodesParent   null                   switch  m12chunk oType                        case HTMLchunkType OpenTag                              Tags are added as a child to the current tag                              except when the new tag implies the closure of                             some number of ancestor tags                           newNode   ParseTagNode m12chunk  originalHtml  ref xmlnsAttributeIndex                            if  newNode    null                                currentNode   FindParentOfNewNode m12chunk  originalHtml  currentNode                                newNodesParent   currentNode                               newNodesParent Add newNode                                currentNode   newNode as XElement                                                     break                       case HTMLchunkType CloseTag                           if  m12chunk bEndClosure                                 newNode   ParseTagNode m12chunk  originalHtml  ref xmlnsAttributeIndex                                if  newNode    null                                    currentNode   FindParentOfNewNode m12chunk  originalHtml  currentNode                                    newNodesParent   currentNode                                  newNodesParent Add newNode                                                                                   else                               XElement nodeToClose   currentNode                               string m12chunkCleanedTag   CleanupTagName m12chunk sTag  originalHtml                                while  nodeToClose    null  amp  amp  nodeToClose Name LocalName    m12chunkCleanedTag                                  nodeToClose   nodeToClose Parent                               if  nodeToClose    null                                  currentNode   nodeToClose Parent                               Debug Assert currentNode    null                                                      break                       case HTMLchunkType Script                           newNode   new XElement  script    REMOVED                            newNodesParent   currentNode                          newNodesParent Add newNode                           break                       case HTMLchunkType Comment                           newNodesParent   currentNode                           if  m12chunk sTag      --                               newNode   new XComment m12chunk oHTML                           else if  m12chunk sTag       CDATA                                newNode   new XCData m12chunk oHTML                           else                             throw new Exception  Unrecognized comment sTag                             newNodesParent Add newNode                            break                       case HTMLchunkType Text                           currentNode Add m12chunk oHTML                           break                       default                          break                                              catch  Exception e                    var wrappedE   new Exception  Error using Majestic12 HTMLChunk  reason      e Message  e                       the original html is copied for tracing debugging purposes                 originalHtml   new string htmlAsBytes Skip m12chunk iChunkOffset                       Take m12chunk iChunkLength                       Select B   gt   char B  ToArray                       wrappedE Data Add  source   originalHtml                    throw wrappedE                                   while  currentNode Parent    null              currentNode   currentNode Parent           return currentNode Nodes               static XElement FindParentOfNewNode Majestic12 HTMLchunk m12chunk  string originalHtml  XElement nextPotentialParent             string m12chunkCleanedTag   CleanupTagName m12chunk sTag  originalHtml            XElement discoveredParent   null              Get a list of all ancestors         List lt XElement gt  ancestors   new List lt XElement gt             XElement ancestor   nextPotentialParent          while  ancestor    null                ancestors Add ancestor               ancestor   ancestor Parent                        Check if the new tag implies a previous tag was closed          if   form     m12chunkCleanedTag                 discoveredParent   ancestors                  Where XE   gt  m12chunkCleanedTag    XE Name                   Take 1                   Select XE   gt  XE Parent                   FirstOrDefault                      else if   td     m12chunkCleanedTag                 discoveredParent   ancestors                  TakeWhile XE   gt   tr     XE Name                   Where XE   gt  m12chunkCleanedTag    XE Name                   Take 1                   Select XE   gt  XE Parent                   FirstOrDefault                      else if   tr     m12chunkCleanedTag                 discoveredParent   ancestors                  TakeWhile XE   gt     table     XE Name                                         thead     XE Name                                         tbody     XE Name                                         tfoot     XE Name                    Where XE   gt  m12chunkCleanedTag    XE Name                   Take 1                   Select XE   gt  XE Parent                   FirstOrDefault                      else if   thead     m12chunkCleanedTag                       tbody     m12chunkCleanedTag                       tfoot     m12chunkCleanedTag                  discoveredParent   ancestors                  TakeWhile XE   gt   table     XE Name                   Where XE   gt  m12chunkCleanedTag    XE Name                   Take 1                   Select XE   gt  XE Parent                   FirstOrDefault                       return discoveredParent    nextPotentialParent             static string CleanupTagName string originalName  string originalHtml             string tagName   originalName           tagName   tagName TrimStart new char                 for nodes  lt  xml  gt           if  tagName Contains                   tagName   tagName Substring tagName LastIndexOf        1            return tagName             static readonly Regex  startsAsNumeric   new Regex     0-9    RegexOptions Compiled        static bool TryCleanupAttributeName string originalName  ref int xmlnsIndex  out string result             result   null          string attributeName   originalName           if  string IsNullOrEmpty originalName               return false           if   startsAsNumeric IsMatch originalName               return false                         transform xmlns attributes so they don t actually create any XML namespaces                    if  attributeName ToLower   Equals  xmlns                   attributeName    xmlns     xmlnsIndex ToString                  xmlnsIndex                      else               if  attributeName ToLower   StartsWith  xmlns                       attributeName    xmlns     attributeName Substring  xmlns   Length                                                   trim trailing                               attributeName   attributeName TrimEnd new char                           attributeName   attributeName Replace                               result   attributeName           return true             static Regex  weirdTag   new Regex     lt         gt               matches   lt   if  supportEmptyParas  gt       static Regex  aspnetPrecompiled   new Regex     lt      gt         matches   lt          gt       static Regex  shortHtmlComment   new Regex     lt  -  - gt         matches   lt  -Extra Images- gt        static XElement ParseTagNode Majestic12 HTMLchunk m12chunk  string originalHtml  ref int xmlnsIndex             if  string IsNullOrEmpty m12chunk sTag                  if  m12chunk sParams Length  gt  0  amp  amp  m12chunk sParams 0  ToLower   Equals  doctype                    return new XElement  doctype                 if   weirdTag IsMatch originalHtml                   return new XElement  REMOVED weirdBlockParenthesisTag                 if   aspnetPrecompiled IsMatch originalHtml                   return new XElement  REMOVED ASPNET PrecompiledDirective                 if   shortHtmlComment IsMatch originalHtml                   return new XElement  REMOVED ShortHtmlComment                    Nodes like   lt br  lt br gt   will end up with a m12chunk sTag         We discard these nodes              return null                     string tagName   CleanupTagName m12chunk sTag  originalHtml            XElement result   new XElement tagName            List lt XAttribute gt  attributes   new List lt XAttribute gt              for  int i   0  i  lt  m12chunk iParams  i                   if  m12chunk sParams i       lt  --                         an HTML comment was embedded within a tag   This comment and its contents                    will be interpreted as attributes by Majestic-12    skip this attributes                 for    i  lt  m12chunk iParams  i                           if  m12chunk sTag     --     m12chunk sTag     -- gt                            break                                     continue                             if  m12chunk sParams i          amp  amp  string IsNullOrEmpty m12chunk sValues i                    continue               string attributeName   m12chunk sParams i                if   TryCleanupAttributeName attributeName  ref xmlnsIndex  out attributeName                   continue               attributes Add new XAttribute attributeName  m12chunk sValues i                           If attributes are duplicated with different values  we complain             If attributes are duplicated with the same value  we remove all but 1          var duplicatedAttributes   attributes GroupBy A   gt  A Name  Where G   gt  G Count    gt  1            foreach  var duplicatedAttribute in duplicatedAttributes                 if  duplicatedAttribute GroupBy DA   gt  DA Value  Count    gt  1                  throw new Exception  Attribute value was given different values                 attributes RemoveAll A   gt  A Name    duplicatedAttribute Key               attributes Add duplicatedAttribute First                        result Add attributes            return result             static HTMLparser OpenParser             HTMLparser oP   new HTMLparser                The code comments in this function are from the Majestic-12 sample documentation                              This is optional  but if you want high performance then you may            want to set chunk hash mode to FALSE  This would result in tag params            being added to string arrays in HTMLchunk object called sParams and sValues  with number            of actual params being in iParams  See code below for details                        When TRUE  and its default  tag params will be added to hashtable HTMLchunk  object  oParams         oP SetChunkHashMode false               if you set this to true then original parsed HTML for given chunk will be kept -             this will reduce performance somewhat  but may be desireable in some cases where            reconstruction of HTML may be necessary         oP bKeepRawHTML   false              if set to true  it is false by default   then entities will be decoded  this is essential            if you want to get strings that contain final representation of the data in HTML  however            you should be aware that if you want to use such strings into output HTML string then you will            need to do Entity encoding or same string may fail later         oP bDecodeEntities   true              we have option to keep most entities as is - only replace stuff like  amp nbsp              this is called Mini Entities mode - it is handy when HTML will need            to be re-created after it was parsed  though in this case really            entities should not be parsed at all         oP bDecodeMiniEntities   true           if   oP bDecodeEntities  amp  amp  oP bDecodeMiniEntities              oP InitMiniEntities                if set to true  then in case of Comments and SCRIPT tags the data set to oHTML will be            extracted BETWEEN those tags  rather than include complete RAW HTML that includes tags too            this only works if auto extraction is enabled         oP bAutoExtractBetweenTagsOnly   true              if true then comments will be extracted automatically         oP bAutoKeepComments   true              if true then scripts will be extracted automatically           oP bAutoKeepScripts   true              if this option is true then whitespace before start of tag will be compressed to single            space character in string       if false then full whitespace before tag will be returned  slower             you may only want to set it to false if you want exact whitespace between tags  otherwise it is just            a waste of CPU cycles         oP bCompressWhiteSpaceBeforeTag   true              if true  default  then tags with attributes marked as CLOSED    at the end  will be automatically            forced to be considered as open tags - this is no good for XML parsing  but I keep it for backwards            compatibility for my stuff as it makes it easier to avoid checking for same tag which is both closed            or open         oP bAutoMarkClosedTagsWithParamsAsOpen   false           return oP

User · Answer

You could use TidyNet Tidy to convert the HTML to XHTML  and then use an XML parser   Another alternative would be to use the builtin engine mshtml   using mshtml      object   oPageText     html    HTMLDocument doc   new HTMLDocumentClass    IHTMLDocument2 doc2    IHTMLDocument2 doc  doc2 write oPageText     This allows you to use javascript-like functions like getElementById

User · Answer

Depending on your needs you might  go for the more feature-rich libraries  I tried most all of the solutions suggested  but what stood out head  amp  shoulders was Html Agility Pack  It is a very forgiving and flexible parser

User · Answer

I wrote some classes for parsing HTML tags in C   They are nice and simple if they meet your particular needs   You can read an article about them and download the source code at http   www blackbeltcoder com Articles strings parsing-html-tags-in-c   There s also an article about a generic parsing helper class at http   www blackbeltcoder com Articles strings a-text-parsing-helper-class

User · Answer

I ve used ZetaHtmlTidy in the past to load random websites and then hit against various parts of the content with xpath  eg  html body  p  class  textblock     It worked well but there were some exceptional sites that it had problems with  so I don t know if it s the absolute best solution

User · Answer

You can do a lot without going nuts on 3rd-party products and mshtml  i e  interop    use the System Windows Forms WebBrowser   From there  you can do such things as  GetElementById  on an HtmlDocument or  GetElementsByTagName  on HtmlElements   If you want to actually inteface with the browser  simulate button clicks for example   you can use a little reflection  imo a lesser evil than Interop  to do it   var wb   new WebBrowser         tell the browser to navigate  tangential to this question   Then on the Document Completed event you can simulate clicks like this   var doc   wb Browser Document var elem   doc GetElementById elementId   object obj   elem DomElement  System Reflection MethodInfo mi   obj GetType   GetMethod  click    mi Invoke obj  new object 0      you can do similar reflection stuff to submit forms  etc   Enjoy

User · Answer

Try this script   http   www biterscripting com SS URLs html  When I use it with this url   script SS URLs txt URL  http   stackoverflow com questions 56107 what-is-the-best-way-to-parse-html-in-c     It shows me all the links on the page for this thread   http   sstatic net so all css http   sstatic net so favicon ico http   sstatic net so apple-touch-icon png         You can modify that script to check for images  variables  whatever

User · Answer

Use WatiN if you need to see the impact of JS on the page  and you re prepared to start a browser

User · Answer

The Html Agility Pack has been mentioned before - if you are going for speed  you might also want to check out the Majestic-12 HTML parser  Its handling is rather clunky  but it delivers a really fast parsing experience

[c#] What is the best way to parse html in C#?

Examples related to c#

Examples related to .net

Examples related to html

Examples related to parsing

Examples related to html-content-extraction