How do I parse a HTML page with Node js

Question

I need to parse  server side  big amounts of HTML pages  We all agree that regexp is not the way to go here  It seems to me that javascript is the native way of parsing a HTML page  but that assumption relies on the server side code having all the DOM ability javascript has inside a browser   Does Node js have that ability built in  Is there a better approach to this problem  parsing HTML on the server side

User · Accepted Answer

You can use the npm modules jsdom and htmlparser to create and parse a DOM in Node.JS.

Other options include:

BeautifulSoup for python
you can convert you html to xhtml and use XSLT
HTMLAgilityPack for .NET
CsQuery for .NET (my new favorite)
The spidermonkey and rhino JS engines have native E4X support. This may be useful, only if you convert your html to xhtml.

Out of all these options, I prefer using the Node.js option, because it uses the standard W3C DOM accessor methods and I can reuse code on both the client and server. I wish BeautifulSoup's methods were more similar to the W3C dom, and I think converting your HTML to XHTML to write XSLT is just plain sadistic.

User · Answer

jsdom is too strict to do any real screen scraping sort of things  but beautifulsoup doesn t choke on bad markup   node-soupselect is a port of python s beautifulsoup into nodejs  and it works beautifully

User · Answer

Use htmlparser2  its way faster and pretty straightforward  Consult this usage example   https   www npmjs org package htmlparser2 usage  And the live demo here   http   demos forbeslindesay co uk htmlparser2

User · Answer

Use Cheerio  It isn t as strict as jsdom and is optimized for scraping  As a bonus  uses the jQuery selectors you already know        Familiar syntax  Cheerio implements a subset of core jQuery  Cheerio   removes all the DOM inconsistencies and browser cruft from the jQuery   library  revealing its truly gorgeous API         Blazingly fast  Cheerio works with a very simple  consistent DOM   model  As a result parsing  manipulating  and rendering are incredibly   efficient  Preliminary end-to-end benchmarks suggest that cheerio is   about 8x faster than JSDOM         Insanely flexible  Cheerio wraps around  FB55 s forgiving   htmlparser  Cheerio can parse nearly any HTML or XML document

User · Answer

Htmlparser2 by FB55 seems to be a good alternative

User · Answer

November 2020 Update I searched for the top NodeJS html parser libraries  Because my use cases didn t require a library with many features  I could focus on stability and performance  By stability I mean that I want the library to be used long enough by the community in order to find bugs and that it will be still maintained and that open issues will be closed  Its hard to understand the future of an open source library  but I did a small summary based on the top 10 libraries in openbase  I divided into 2 groups according to the last commit  and on each group the order is according to Github starts   Last commit is in the last 6 months  jsdom -            Last commit  3 Months       Open issues  331      Github stars  14 9K  htmlparser2 -       Last commit  8 days         Open issues  2        Github stars  2 7K  parse5  -           Last commit  2 Months       Open issues  21       Github stars  2 5K  swagger-parser -    Last commit  2 Months      Open issues  48       Github stars  663  html-parse-stringify - Last commit  4 Months      Open issues  3        Github stars  215  node-html-parser -   Last commit  7 days        Open issues  15        Github stars  205  Last commit is 6 months and above  cheerio -           Last commit  1 year       Open issues  174       Github stars  22 9K  koa-bodyparser -    Last commit  6 months     Open issues  9         Github stars  1 1K  sax-js  -            Last commit  3 Years      Open issues  65        Github stars  941  draftjs-to-html -     Last commit  1 Year       Open issues  27        Github stars  233   I picked Node-html-parser because it seems quiet fast and very active at this moment      Openbase adds much more information regarding each library like the number of contributors  with  3 commits   weekly downloads  Monthly commits  Version etc        The table above is a snapshot according to the specific time and date - I would check the reference again and as a first step check the level of recent activity and then dive into the smaller details

[node.js] How do I parse a HTML page with Node.js

Examples related to node.js

Examples related to html-parsing

Examples related to server-side