How do you parse and process HTML XML in PHP

Question

How can one parse HTML XML and extract information from it

User · Answer

This sounds like a good task description of W3C XPath technology  It s easy to express queries like  return all href attributes in img tags that are nested in  lt foo gt  lt bar gt  lt baz gt  elements   Not being a PHP buff  I can t tell you in what form XPath may be available  If you can call an external program to process the HTML file you should be able to use a command line version of XPath  For a quick intro  see http   en wikipedia org wiki XPath

User · Answer

Yes you can use simple html dom for the purpose  However I have worked quite a lot with the simple html dom  particularly for web scraping and have found it to be too vulnerable  It does the basic job but I won t recommend it anyways  I have never used curl for the purpose but what I have learned is that curl can do the job much more efficiently and is much more solid  Kindly check out this link scraping-websites-with-curl

User · Answer

This is commonly referred to as screen scraping  by the way  The library I have used for this is Simple HTML Dom Parser

User · Answer

Advanced Html Dom is a simple HTML DOM replacement that offers the same interface  but it s DOM-based which means none of the associated memory issues occur   It also has full CSS support  including jQuery extensions

User · Answer

There are many ways to process HTML XML DOM of which most have already been mentioned  Hence  I won t make any attempt to list those myself   I merely want to add that I personally prefer using the DOM extension and why     iit makes optimal use of the performance advantage of the underlying C code it s OO PHP  and allows me to subclass it  it s rather low level  which allows me to use it as a non-bloated foundation for more advanced behavior  it provides access to every part of the DOM  unlike eg  SimpleXml  which ignores some of the lesser known XML features  it has a syntax used for DOM crawling that s similar to the syntax used in native Javascript    And while I miss the ability to use CSS selectors for DOMDocument  there is a rather simple and convenient way to add this feature  subclassing the DOMDocument and adding JS-like querySelectorAll and querySelector methods to your subclass   For parsing the selectors  I recommend using the very minimalistic CssSelector component from the Symfony framework  This component just translates CSS selectors to XPath selectors  which can then be fed into a DOMXpath to retrieve the corresponding Nodelist   You can then use this  still very low level  subclass as a foundation for more high level classes  intended to eg  parse very specific types of XML or add more jQuery-like behavior   The code below comes straight out my DOM-Query library and uses the technique I described   For HTML parsing    namespace PowerTools   use  Symfony Component CssSelector CssSelector as CssSelector   class DOM Document extends  DOMDocument       public function   construct  data   false   doctype    html    encoding    UTF-8    version    1 0             parent    construct  version   encoding           if   doctype  amp  amp   doctype      html                   this- gt loadHTML  data             else                 this- gt loadXML  data                        public function querySelectorAll  selector   contextnode   null            if  isset  this- gt doctype- gt name   amp  amp   this- gt doctype- gt name     html                 CssSelector  enableHtmlExtension              else               CssSelector  disableHtmlExtension                       xpath   new  DOMXpath  this           return  xpath- gt query CssSelector  toXPath  selector   descendant       contextnode                         public function loadHTMLFile  filename   options   0             this- gt loadHTML file get contents  filename    options              public function loadHTML  source   options   0            if   source  amp  amp   source                       data   trim  source                html5   new HTML5 array  targetDocument    gt   this   disableHtmlNsInDom    gt  true                 data start   mb substr  data  0  10               if  strpos  data start    lt  DOCTYPE        0    strpos  data start    lt html gt        0                     html5- gt loadHTML  data                 else                     this- gt loadHTML   lt  DOCTYPE html gt  lt html gt  lt head gt  lt meta charset       encoding        gt  lt  head gt  lt body gt  lt  body gt  lt  html gt                      t    html5- gt loadHTMLFragment  data                    docbody    this- gt getElementsByTagName  body  - gt item 0                   while   t- gt hasChildNodes                           docbody- gt appendChild  t- gt firstChild                                                                  See also Parsing XML documents with CSS selectors by Symfony s creator Fabien Potencier on his decision to create the CssSelector component for Symfony and how to use it

User · Answer

I recommend PHP Simple HTML DOM Parser   It really has nice features  like   foreach  html- gt find  img   as  element         echo  element- gt src     lt br gt

User · Answer

I have written a general purpose XML parser that can easily handle GB files  It s based on XMLReader and it s very easy to use    source   new XmlExtractor  path to tag     path to file xml    foreach   source as  tag        echo  tag- gt field1      echo  tag- gt field2- gt subfield1      Here s the github repo  XmlExtractor

User · Answer

Another option you can try is QueryPath  It s inspired by jQuery  but on the server in PHP and used in Drupal

User · Answer

We have created quite a few crawlers for our needs before  At the end of the day  it is usually simple regular expressions that do the thing best  While libraries listed above are good for the reason they are created  if you know what you are looking for  regular expressions is a safer way to go  as you can handle also non-valid HTML XHTML structures  which would fail  if loaded via most of the parsers

User · Answer

With FluidXML you can query and iterate XML using XPath and CSS Selectors    doc   fluidxml   lt html gt     lt  html gt       title    doc- gt query    head title   0 - gt nodeValue    doc- gt query    body p    div active     bgId           - gt each function  i   node                    node is a DOMNode               tag      node- gt nodeName               text     node- gt nodeValue               class    node- gt getAttribute  class                  https   github com servo-php fluidxml

User · Answer

XML HTMLSax is rather stable - even if it s not maintained any more  Another option could be to pipe you HTML through Html Tidy and then parse it with standard XML tools

User · Answer

The best method for parse xml    xml  http   www example com rss xml    rss   simplexml load string  xml    i   0  foreach   rss- gt channel- gt item as  feedItem       i      echo  title  feedItem- gt title    echo   lt br gt      echo  link  feedItem- gt link    echo   lt br gt      if  feedItem- gt description              des  feedItem- gt description      else        des           echo  des    echo   lt br gt      if  i gt 5  break

User · Answer

You could try using something like HTML Tidy to cleanup any  broken  HTML and convert the HTML to XHTML  which you can then parse with a XML parser

User · Answer

For 1a and 2  I would vote for the new Symfony Componet class DOMCrawler   DomCrawler    This class allows queries similar to CSS Selectors  Take a look at this presentation for real-world examples  news-of-the-symfony2-world   The component is designed to work standalone and can be used without Symfony   The only drawback is that it will only work with PHP 5 3 or newer

User · Answer

Just use DOMDocument- loadHTML   and be done with it  libxml s HTML parsing algorithm is quite good and fast  and contrary to popular belief  does not choke on malformed HTML

User · Answer

One general approach I haven t seen mentioned here is to run HTML through Tidy  which can be set to spit out guaranteed-valid XHTML  Then you can use any old XML library on it   But to your specific problem  you should take a look at this project  http   fivefilters org content-only  -- it s a modified version of the Readability algorithm  which is designed to extract just the textual content  not headers and footers  from a page

User · Answer

Simple HTML DOM is a great open-source parser     simplehtmldom sourceforge  It treats DOM elements in an object-oriented way  and the new iteration has a lot of coverage for non-compliant code  There are also some great functions like you d see in JavaScript  such as the  find  function  which will return all instances of elements of that tag name     I ve used this in a number of tools  testing it on many different types of web pages  and I think it works great

User · Answer

You could try using something like HTML Tidy to cleanup any  broken  HTML and convert the HTML to XHTML  which you can then parse with a XML parser

User · Answer

Just use DOMDocument- loadHTML   and be done with it  libxml s HTML parsing algorithm is quite good and fast  and contrary to popular belief  does not choke on malformed HTML

User · Answer

Try Simple HTML DOM Parser   A HTML DOM parser written in PHP nbsp 5  that lets you manipulate HTML in a very easy way  Require PHP 5   Supports invalid HTML  Find tags on an HTML page with selectors just like jQuery  Extract contents from HTML in a single line  Download    Examples   How to get HTML elements      Create DOM from URL or file  html   file get html  http   www example com         Find all images foreach  html- gt find  img   as  element         echo  element- gt src     lt br gt        Find all links foreach  html- gt find  a   as  element         echo  element- gt href     lt br gt        How to modify HTML elements      Create DOM from string  html   str get html   lt div id  hello  gt Hello lt  div gt  lt div id  world  gt World lt  div gt       html- gt find  div   1 - gt class    bar     html- gt find  div id hello    0 - gt innertext    foo    echo  html      Extract content from HTML      Dump contents  without tags  from HTML echo file get html  http   www google com   - gt plaintext      Scraping Slashdot      Create DOM from URL  html   file get html  http   slashdot org         Find all article blocks foreach  html- gt find  div article   as  article         item  title          article- gt find  div title   0 - gt plaintext       item  intro         article- gt find  div intro   0 - gt plaintext       item  details      article- gt find  div details   0 - gt plaintext       articles      item     print r  articles

User · Answer

Simple HTML DOM is a great open-source parser     simplehtmldom sourceforge  It treats DOM elements in an object-oriented way  and the new iteration has a lot of coverage for non-compliant code  There are also some great functions like you d see in JavaScript  such as the  find  function  which will return all instances of elements of that tag name     I ve used this in a number of tools  testing it on many different types of web pages  and I think it works great

User · Answer

Why you shouldn t and when you should use regular expressions   First off  a common misnomer  Regexps are not for  parsing  HTML  Regexes can however  extract  data  Extracting is what they re made for  The major drawback of regex HTML extraction over proper SGML toolkits or baseline XML parsers are their syntactic effort and varying reliability   Consider that making a somewhat dependable HTML extraction regex    lt a s class   playbutton d    gt   id    d           lt a s class    w s  title   w s      gt   href   http       gt        gt    gt     lt  gt     lt  a gt       is way less readable than a simple phpQuery or QueryPath equivalent    div- gt find   stationcool a  - gt attr  title      There are however specific use cases where they can help    Many DOM traversal frontends don t reveal HTML comments  lt  --  which however are sometimes the more useful anchors for extraction  In particular pseudo-HTML variations  lt  var gt  or SGML residues are easy to tame with regexps  Oftentimes regular expressions can save post-processing  However HTML entities often require manual caretaking  And lastly  for extremely simple tasks like extracting  lt img src  urls  they are in fact a probable tool  The speed advantage over SGML XML parsers mostly just comes to play for these very basic extraction procedures    It s sometimes even advisable to pre-extract a snippet of HTML using regular expressions   lt  --CONTENT-- gt       lt  --END-- gt   and process the remainder using the simpler HTML parser frontends   Note  I actually have this app  where I employ XML parsing and regular expressions alternatively  Just last week the PyQuery parsing broke  and the regex still worked  Yes weird  and I can t explain it myself  But so it happened  So please don t vote real-world considerations down  just because it doesn t match the regex evil meme  But let s also not vote this up too much  It s just a sidenote for this topic

User · Answer

XML HTMLSax is rather stable - even if it s not maintained any more  Another option could be to pipe you HTML through Html Tidy and then parse it with standard XML tools

User · Answer

For HTML5  html5 lib has been abandoned for years now  The only HTML5 library I can find with a recent update and maintenance records is html5-php which was just brought to beta 1 0 a little over a week ago

User · Answer

If you re familiar with jQuery selector  you can use ScarletsQuery for PHP   lt pre gt  lt  php include  ScarletsQuery php       Load the HTML content and parse it  html   file get contents  https   www lipsum com     dom   Scarlets Library MarkupLanguage  parseText  html       Select meta tag on the HTML header  description    dom- gt selector  head meta name  description     0       Get  content  attribute value from meta tag print r  description- gt attr  content       description    dom- gt selector   Content p        Get element array print r  description- gt view     This library usually taking less than 1 second to process offline html  It also accept invalid HTML or missing quote on tag attributes

User · Answer

I created a library named PHPPowertools DOM-Query  which allows you to crawl HTML5 and XML documents just like you do with jQuery  Under the hood  it uses symfony DomCrawler for conversion of CSS selectors to XPath selectors  It always uses the same DomDocument  even when passing one object to another  to ensure decent performance   Example use   namespace PowerTools      Get file content  htmlcode   file get contents  https   github com        Define your DOMCrawler based on file string  H   new DOM Query  htmlcode       Define your DOMCrawler based on an existing DOM Query instance  H   new DOM Query  H- gt select  body         Passing a string  CSS selector   s    H- gt select  div foo        Passing an element object  DOM Element   s    H- gt select  documentBody       Passing a DOM Query object  s    H- gt select   H- gt select  p   p         Select the body tag  body    H- gt select  body        Combine different classes as one selector to get all site blocks  siteblocks    body- gt select   site-header   masthead   site-body   site-footer        Nest your methods just like you would with jQuery  siteblocks- gt select  button  - gt add  span  - gt addClass  icon icon-printer        Use a lambda function to set the text of all site blocks  siteblocks- gt text function   i   val        return  i    quot  -  quot     val- gt attr  class            Append the following HTML to all site blocks  siteblocks- gt append   lt div class  quot site-center quot  gt  lt  div gt         Use a descendant selector to select the site s footer  sitefooter    body- gt select   site-footer  gt   site-center        Set some attributes for the site s footer  sitefooter- gt attr array  id    gt   aweeesome    data-val    gt   see         Use a lambda function to set the attributes of all site blocks  siteblocks- gt attr  data-val   function   i   val        return  i    quot  -  quot     val- gt attr  class      quot  - photo by Kelly Clark quot           Select the parent of the site s footer  sitefooterparent    sitefooter- gt parent        Remove the class of all i-tags within the site s footer s parent  sitefooterparent- gt select  i  - gt removeAttr  class        Wrap the site s footer within two nex selectors  sitefooter- gt wrap   lt section gt  lt div class  quot footer-wrapper quot  gt  lt  div gt  lt  section gt              Supported methods     x     1   x    parseHTML  x    parseXML  x    parseJSON  x   selection add  x   selection addClass  x   selection after  x   selection append  x   selection attr  x   selection before  x   selection children  x   selection closest  x   selection contents  x   selection detach  x   selection each  x   selection eq  x   selection empty  2   x   selection find  x   selection first  x   selection get  x   selection insertAfter  x   selection insertBefore  x   selection last  x   selection parent  x   selection parents  x   selection remove  x   selection removeAttr  x   selection removeClass  x   selection text  x   selection wrap    Renamed  select   for obvious reasons Renamed  void   since  empty  is a reserved word in PHP   NOTE   The library also includes its own zero-configuration autoloader for PSR-0 compatible libraries  The example included should work out of the box without any additional configuration  Alternatively  you can use it with composer

User · Answer

Native XML Extensions I prefer using one of the native XML extensions since they come bundled with PHP  are usually faster than all the 3rd party libs and give me all the control I need over the markup  DOM  The DOM extension allows you to operate on XML documents through the DOM API with PHP 5  It is an implementation of the W3C s Document Object Model Core Level 3  a platform- and language-neutral interface that allows programs and scripts to dynamically access and update the content  structure and style of documents   DOM is capable of parsing and modifying real world  broken  HTML and it can do XPath queries  It is based on libxml  It takes some time to get productive with DOM  but that time is well worth it IMO  Since DOM is a language-agnostic interface  you ll find implementations in many languages  so if you need to change your programming language  chances are you will already know how to use that language s DOM API then  A basic usage example can be found in Grabbing the href attribute of an A element and a general conceptual overview can be found at DOMDocument in php How to use the DOM extension has been covered extensively on StackOverflow  so if you choose to use it  you can be sure most of the issues you run into can be solved by searching browsing Stack Overflow  XMLReader  The XMLReader extension is an XML pull parser  The reader acts as a cursor going forward on the document stream and stopping at each node on the way   XMLReader  like DOM  is based on libxml  I am not aware of how to trigger the HTML Parser Module  so chances are using XMLReader for parsing broken HTML might be less robust than using DOM where you can explicitly tell it to use libxml s HTML Parser Module  A basic usage example can be found at getting all values from h1 tags using php XML Parser  This extension lets you create XML parsers and then define handlers for different XML events  Each XML parser also has a few parameters you can adjust   The XML Parser library is also based on libxml  and implements a SAX style XML push parser  It may be a better choice for memory management than DOM or SimpleXML  but will be more difficult to work with than the pull parser implemented by XMLReader  SimpleXml  The SimpleXML extension provides a very simple and easily usable toolset to convert XML to an object that can be processed with normal property selectors and array iterators   SimpleXML is an option when you know the HTML is valid XHTML  If you need to parse broken HTML  don t even consider SimpleXml because it will choke  A basic usage example can be found at A simple program to CRUD node and node values of xml file and there is lots of additional examples in the PHP Manual   3rd Party Libraries  libxml based  If you prefer to use a 3rd-party lib  I d suggest using a lib that actually uses DOM libxml underneath instead of string parsing  FluentDom - Repo  FluentDOM provides a jQuery-like fluent XML interface for the DOMDocument in PHP  Selectors are written in XPath or CSS  using a CSS to XPath converter   Current versions extend the DOM implementing standard interfaces and add features from the DOM Living Standard  FluentDOM can load formats like JSON  CSV  JsonML  RabbitFish and others  Can be installed via Composer   HtmlPageDom  Wa72 HtmlPageDom  is a PHP library for easy manipulation of HTML documents using  It requires DomCrawler from Symfony2 components for traversing  the DOM tree and extends it by adding methods for manipulating the DOM tree of HTML documents   phpQuery  not updated for years   phpQuery is a server-side  chainable  CSS3 selector driven Document Object Model  DOM  API based on jQuery JavaScript Library written in PHP5 and provides additional Command Line Interface  CLI    Also see  https   github com electrolinux phpquery Zend Dom  Zend Dom provides tools for working with DOM documents and structures  Currently  we offer Zend Dom Query  which provides a unified interface for querying DOM documents utilizing both XPath and CSS selectors   QueryPath  QueryPath is a PHP library for manipulating XML and HTML  It is designed to work not only with local files  but also with web services and database resources  It implements much of the jQuery interface  including CSS-style selectors   but it is heavily tuned for server-side use   Can be installed via Composer   fDOMDocument  fDOMDocument extends the standard DOM to use exceptions at all occasions of errors instead of PHP warnings or notices  They also add various custom methods and shortcuts for convenience and to simplify the usage of DOM   sabre xml  sabre xml is a library that wraps and extends the XMLReader and XMLWriter classes to create a simple  quot xml to object array quot  mapping system and design pattern  Writing and reading XML is single-pass and can therefore be fast and require low memory on large xml files   FluidXML  FluidXML is a PHP library for manipulating XML with a concise and fluent API  It leverages XPath and the fluent programming pattern to be fun and effective    3rd-Party  not libxml-based  The benefit of building upon DOM libxml is that you get good performance out of the box because you are based on a native extension  However  not all 3rd-party libs go down this route  Some of them listed below PHP Simple HTML DOM Parser   An HTML DOM parser written in PHP5  lets you manipulate HTML in a very easy way  Require PHP 5   Supports invalid HTML  Find tags on an HTML page with selectors just like jQuery  Extract contents from HTML in a single line    I generally do not recommend this parser  The codebase is horrible and the parser itself is rather slow and memory hungry  Not all jQuery Selectors  such as child selectors  are possible  Any of the libxml based libraries should outperform this easily  PHP Html Parser  PHPHtmlParser is a simple  flexible  html parser which allows you to select tags using any css selector  like jQuery  The goal is to assiste in the development of tools which require a quick  easy way to scrape html  whether it s valid or not  This project was original supported by sunra php-simple-html-dom-parser but the support seems to have stopped so this project is my adaptation of his previous work   Again  I would not recommend this parser  It is rather slow with high CPU usage  There is also no function to clear memory of created DOM objects  These problems scale particularly with nested loops  The documentation itself is inaccurate and misspelled  with no responses to fixes since 14 Apr 16  Ganon   A universal tokenizer and HTML XML RSS DOM Parser     Ability to manipulate elements and their attributes       Supports invalid HTML and UTF8       Can perform advanced CSS3-like queries on elements  like jQuery -- namespaces supported     A HTML beautifier  like HTML Tidy      Minify CSS and Javascript       Sort attributes  change character case  correct indentation  etc     Extensible     Parsing documents using callbacks based on current character token       Operations separated in smaller functions for easy overriding    Fast and Easy   Never used it  Can t tell if it s any good   HTML 5 You can use the above for parsing HTML5  but there can be quirks due to the markup HTML5 allows  So for HTML5 you want to consider using a dedicated parser  like html5lib  A Python and PHP implementations of a HTML parser based on the WHATWG HTML5 specification for maximum compatibility with major desktop web browsers   We might see more dedicated parsers once HTML5 is finalized  There is also a blogpost by the W3 s titled How-To for html 5 parsing that is worth checking out   WebServices If you don t feel like programming PHP  you can also use Web services  In general  I found very little utility for these  but that s just me and my use cases  ScraperWiki   ScraperWiki s external interface allows you to extract data in the form you want for use on the web or in your own applications  You can also extract information about the state of any scraper    Regular Expressions Last and least recommended  you can extract data from HTML with regular expressions  In general using Regular Expressions on HTML is discouraged  Most of the snippets you will find on the web to match markup are brittle  In most cases they are only working for a very particular piece of HTML  Tiny markup changes  like adding whitespace somewhere  or adding  or changing attributes in a tag  can make the RegEx fails when it s not properly written  You should know what you are doing before using RegEx on HTML  HTML parsers already know the syntactical rules of HTML  Regular expressions have to be taught for each new RegEx you write  RegEx are fine in some cases  but it really depends on your use-case  You can write more reliable parsers  but writing a complete and reliable custom parser with regular expressions is a waste of time when the aforementioned libraries already exist and do a much better job on this  Also see Parsing Html The Cthulhu Way  Books If you want to spend some money  have a look at  PHP Architect s Guide to Webscraping with PHP  I am not affiliated with PHP Architect or the authors

User · Answer

phpQuery and QueryPath are extremely similar in replicating the fluent jQuery API  That s also why they re two of the easiest approaches to properly parse HTML in PHP   Examples for QueryPath  Basically you first create a queryable DOM tree from an HTML string     qp   qp   lt html gt  lt body gt  lt h1 gt title lt  h1 gt           or give filename or URL   The resulting object contains a complete tree representation of the HTML document  It can be traversed using DOM methods  But the common approach is to use CSS selectors like in jQuery     qp- gt find  div classname  - gt children  - gt        foreach   qp- gt find  p img   as  img         print qp  img - gt attr  src         Mostly you want to use simple  id and  class or DIV tag selectors for - gt find    But you can also use XPath statements  which sometimes are faster  Also typical jQuery methods like - gt children   and - gt text   and particularly - gt attr   simplify extracting the right HTML snippets   And already have their SGML entities decoded      qp- gt xpath    div p 1         get first paragraph in a div   QueryPath also allows injecting new tags into the stream  - gt append   and later output and prettify an updated document  - gt writeHTML   It can not only parse malformed HTML  but also various XML dialects  with namespaces   and even extract data from HTML microformats  XFN  vCard      qp- gt find  a target  blank   - gt toggleClass  usability-blunder         phpQuery or QueryPath   Generally QueryPath is better suited for manipulation of documents  While phpQuery also implements some pseudo AJAX methods  just HTTP requests  to more closely resemble jQuery  It is said that phpQuery is often faster than QueryPath  because of fewer overall features      For further information on the differences see this comparison on the wayback machine from tagbyte org   Original source went missing  so here s an internet archive link  Yes  you can still locate missing pages  people    And here s a comprehensive QueryPath introduction   Advantages   Simplicity and Reliability Simple to use alternatives - gt find  a img  a object  div a   Proper data unescaping  in comparison to regular expression grepping

User · Answer

The Symfony framework has bundles which can parse the HTML  and you can use CSS style to select the DOMs instead of using XPath

User · Answer

There are several reasons to not parse HTML by regular expression  But  if you have total control of what HTML will be generated  then you can do with simple regular expression   Above it s a function that parses HTML by regular expression  Note that this function is very sensitive and demands that the HTML obey certain rules  but it works very well in many scenarios  If you want a simple parser  and don t want to install libraries  give this a shot   function array combine   keys   values         result   array        foreach   keys as  i   gt   k             result  k       values  i             array walk  result  create function   amp  v     v    count  v     1   array pop  v    v           return  result     function extract data  str        return  is array  str             array map  extract data    str               preg match all    lt   A-Za-z0-9       gt    gt       lt   1 gt  s    str   matches                  str               array map   extract data    array combine   matches 1    matches 2         print r extract data file get contents  http   www google com

User · Answer

JSON and array from XML in three lines    xml   simplexml load string  xml string    json   json encode  xml    array   json decode  json TRUE     Ta da

User · Answer

I ve created a library called HTML5DOMDocument that is freely available at https   github com ivopetkov html5-dom-document-php  It supports query selectors too that I think will be extremely helpful in your case  Here is some example code    dom   new IvoPetkov HTML5DOMDocument     dom- gt loadHTML   lt  DOCTYPE html gt  lt html gt  lt body gt  lt h1 gt Hello lt  h1 gt  lt div class  content  gt This is some text lt  div gt  lt  body gt  lt  html gt     echo  dom- gt querySelector  h1  - gt innerHTML

User · Answer

QueryPath is good  but be careful of  tracking state  cause if you didn t realise what it means  it can mean you waste a lot of debugging time trying to find out what happened and why the code doesn t work   What it means is that each call on the result set modifies the result set in the object  it s not chainable like in jquery where each link is a new set  you have a single set which is the results from your query and each function call modifies that single set   in order to get jquery-like behaviour  you need to branch before you do a filter modify like operation  that means it ll mirror what happens in jquery much more closely    results   qp  div p     forename    results- gt find  input name  forename         results now contains the result set for input name  forename   NOT the original query  div p  this tripped me up a lot  what I found was that QueryPath tracks the filters and finds and everything which modifies your results and stores them in the object   you need to do this instead   forename    results- gt branch  - gt find  input name  forname       then  results won t be modified and you can reuse the result set again and again  perhaps somebody with much more knowledge can clear this up a bit  but it s basically like this from what I ve found

User · Answer

Third party alternatives to SimpleHtmlDom that use DOM instead of String Parsing  phpQuery  Zend Dom  QueryPath and FluentDom

[php] How do you parse and process HTML/XML in PHP?

Examples related to php

Examples related to xml

Examples related to parsing

Examples related to xml-parsing

Examples related to html-parsing