Converting HTML to XML

Question

I have got hundereds of HTML files that need to be conveted in XML  We are using these HTML to  serve contents for applications but now we have to serve these contents as XML    HTML files are contains  tables  div s  image s  p s  b or strong tags  etc    I googled and found some applications but i couldn t achive yet   Could you suggest a way to convert these file contents to XML

User · Answer

Remember that HTML and XML are two distinct concepts in the tree of markup languages. You can't exactly replace HTML with XML . XML can be viewed as a generalized form of HTML, but even that is imprecise. You mainly use HTML to display data, and XML to carry(or store) the data.

This link is helpful: How to read HTML as XML?

More here - difference between HTML and XML

User · Answer

I did found a way to convert  even bad  html into well formed XML  I started to base this on the DOM loadHTML function  However during time several issues occurred and I optimized and added patches to correct side effects     function tryToXml  dom  content        if   content  return false          xml well formed content can be loaded as xml node tree      fragment    dom- gt createDocumentFragment           wonderfull appendXML to add an XML string directly into the node tree          aappendxml will fail on a xml declaration so manually skip this when occurred     if  substr   content 0  5       lt  xml             content   substr  content strpos  content   gt    1         if  strpos  content   lt                 content   substr  content strpos  content   lt                            if appendXML is not working then use below htmlToXml   for nasty html correction     if    fragment- gt appendXML   content            return  this- gt htmlToXml  dom  content              return  fragment              convert content into xml      dom is only needed to prepare the xml which will be returned   function htmlToXml  dom   content   needEncoding false   bodyOnly true            no xml when html is empty     if   content  return false          real content and possibly it needs encoding     if   needEncoding              no need to convert character encoding as loadHTML will respect the content-type  only         content      lt meta http-equiv  Content-Type  content  text html charset    this- gt encoding    gt      content                return a dom from the content      domInject   new DOMDocument  1 0    UTF-8         domInject- gt preserveWhiteSpace   false       domInject- gt formatOutput   true          html type     try           domInject- gt loadHTML   content          catch Exception  e            do nothing and continue as it s normal that warnings will occur on nasty HTML content                  to check encoding  echo  dom- gt encoding          this- gt reworkDom   domInject         if   bodyOnly            fragment    dom- gt createDocumentFragment              retrieve nodes within  html body       foreach   domInject- gt documentElement- gt childNodes as  elementLevel1            if   elementLevel1- gt nodeName     body  and  elementLevel1- gt nodeType    XML ELEMENT NODE              foreach   elementLevel1- gt childNodes as  elementInject                 fragment- gt insertBefore   dom- gt importNode  elementInject  true                                        else          fragment    dom- gt importNode  domInject- gt documentElement  true              return  fragment             protected function reworkDom   node   level   0                 start with the first child node to iterate          nodeChild    node- gt firstChild           while    nodeChild                   nodeNextChild    nodeChild- gt nextSibling               switch    nodeChild- gt nodeType                     case XML ELEMENT NODE                         iterate through children element nodes                      this- gt reworkDom   nodeChild   level   1                       break                  case XML TEXT NODE                  case XML CDATA SECTION NODE                         do nothing with text  cdata                     break                  case XML COMMENT NODE                         ensure comments to remove - sign also follows the w3c guideline                      nodeChild- gt nodeValue   str replace  -       nodeChild- gt nodeValue                       break                  case XML DOCUMENT TYPE NODE      10  needs to be removed                 case XML PI NODE     7  remove PI                      node- gt removeChild   nodeChild                         nodeChild   null     make null to test later                     break                  case XML DOCUMENT NODE                         should not appear as it s always the root  just to be complete                        however generate exception                  case XML HTML DOCUMENT NODE                         should not appear as it s always the root  just to be complete                        however generate exception                  default                      throw new exception  Engine  reworkDom type not declared     nodeChild- gt nodeType                                   nodeChild    nodeNextChild                      Now this also allows to add more html pieces into one XML which I needed to use myself  In general it can be used like this            c   lt p gt test lt font gt two lt  p gt         dom new DOMDocument  1 0    UTF-8      n  dom- gt appendChild  dom- gt createElement  info        make a root element  if   valueXml tryToXml  dom  c         n- gt appendChild  valueXml         echo   lt pre  gt    htmlentities  dom- gt saveXml  n      lt  pre gt      In this example   lt p gt test lt font gt two lt  p gt   will nicely be outputed in well formed XML as   lt info gt  lt p gt test lt font gt two lt  font gt  lt  p gt  lt  info gt    The info root tag is added as it will also allow to convert   lt p gt one lt  p gt  lt p gt two lt  p gt   which is not XML as it has not one root element  However if you html does for sure have one root element then the extra root  lt info gt  tag can be skipped   With this I m getting real nice XML out of unstructured and even corrupted HTML   I hope it s a bit clear and might contribute to other people to use it

User · Answer

I was successful using tidy command line utility  On linux I installed it quickly with apt-get install tidy  Then the command   tidy -q -asxml --numeric-entities yes source html  gt file xml  gave an xml file  which I was able to process with xslt processor  However I needed to set up xhtml1 dtds correctly   This is their homepage  html-tidy org  and the legacy one  HTML Tidy

[html] Converting HTML to XML

Examples related to html

Examples related to xml