[node.js] How do I parse a HTML page with Node.js

I need to parse (server side) big amounts of HTML pages.
We all agree that regexp is not the way to go here.
It seems to me that javascript is the native way of parsing a HTML page, but that assumption relies on the server side code having all the DOM ability javascript has inside a browser.

Does Node.js have that ability built in?
Is there a better approach to this problem, parsing HTML on the server side?

This question is related to node.js html-parsing server-side

The answer is


Use Cheerio. It isn't as strict as jsdom and is optimized for scraping. As a bonus, uses the jQuery selectors you already know.

? Familiar syntax: Cheerio implements a subset of core jQuery. Cheerio removes all the DOM inconsistencies and browser cruft from the jQuery library, revealing its truly gorgeous API.

? Blazingly fast: Cheerio works with a very simple, consistent DOM model. As a result parsing, manipulating, and rendering are incredibly efficient. Preliminary end-to-end benchmarks suggest that cheerio is about 8x faster than JSDOM.

? Insanely flexible: Cheerio wraps around @FB55's forgiving htmlparser. Cheerio can parse nearly any HTML or XML document.


Use htmlparser2, its way faster and pretty straightforward. Consult this usage example:

https://www.npmjs.org/package/htmlparser2#usage

And the live demo here:

http://demos.forbeslindesay.co.uk/htmlparser2/


November 2020 Update

I searched for the top NodeJS html parser libraries.

Because my use cases didn't require a library with many features, I could focus on stability and performance.

By stability I mean that I want the library to be used long enough by the community in order to find bugs and that it will be still maintained and that open issues will be closed.

Its hard to understand the future of an open source library, but I did a small summary based on the top 10 libraries in openbase.

I divided into 2 groups according to the last commit (and on each group the order is according to Github starts):

Last commit is in the last 6 months:

jsdom - Last commit: 3 Months, Open issues: 331, Github stars: 14.9K.

htmlparser2 - Last commit: 8 days, Open issues: 2, Github stars: 2.7K.

parse5 - Last commit: 2 Months, Open issues: 21, Github stars: 2.5K.

swagger-parser - Last commit: 2 Months, Open issues: 48, Github stars: 663.

html-parse-stringify - Last commit: 4 Months, Open issues: 3, Github stars: 215.

node-html-parser - Last commit: 7 days, Open issues: 15, Github stars: 205.

Last commit is 6 months and above:

cheerio - Last commit: 1 year, Open issues: 174, Github stars: 22.9K.

koa-bodyparser - Last commit: 6 months, Open issues: 9, Github stars: 1.1K.

sax-js - Last commit: 3 Years, Open issues: 65, Github stars: 941.

draftjs-to-html - Last commit: 1 Year, Open issues: 27, Github stars: 233.


I picked Node-html-parser because it seems quiet fast and very active at this moment.

(*) Openbase adds much more information regarding each library like the number of contributors (with +3 commits), weekly downloads, Monthly commits, Version etc'.

(**) The table above is a snapshot according to the specific time and date - I would check the reference again and as a first step check the level of recent activity and then dive into the smaller details.


Htmlparser2 by FB55 seems to be a good alternative.


jsdom is too strict to do any real screen scraping sort of things, but beautifulsoup doesn't choke on bad markup.

node-soupselect is a port of python's beautifulsoup into nodejs, and it works beautifully


Examples related to node.js

Hide Signs that Meteor.js was Used Querying date field in MongoDB with Mongoose SyntaxError: Cannot use import statement outside a module Server Discovery And Monitoring engine is deprecated How to fix ReferenceError: primordials is not defined in node UnhandledPromiseRejectionWarning: This error originated either by throwing inside of an async function without a catch block dyld: Library not loaded: /usr/local/opt/icu4c/lib/libicui18n.62.dylib error running php after installing node with brew on Mac internal/modules/cjs/loader.js:582 throw err DeprecationWarning: Buffer() is deprecated due to security and usability issues when I move my script to another server Please run `npm cache clean`

Examples related to html-parsing

PHP: HTML: send HTML select option attribute in POST Read a HTML file into a string variable in memory Parsing HTML using Python Parse an HTML string with JS HTML Text with tags to formatted text in an Excel cell How do I parse a HTML page with Node.js Regex select all text between tags How to extract string following a pattern with grep, regex or perl How to strip HTML tags from string in JavaScript? How do you parse and process HTML/XML in PHP?

Examples related to server-side

How do I parse a HTML page with Node.js Send message to specific client with socket.io and node.js Do copyright dates need to be updated? Using Excel OleDb to get sheet names IN SHEET ORDER