[html] HTML-parser on Node.js

Is there something like Ruby's nokogiri on nodejs? I mean a user-friendly HTML-parser.

I'd seen on Node.js modules page some parsers, but I can't find something pretty and fresh.

This question is related to html parsing node.js nokogiri

The answer is


If you want to build DOM you can use jsdom.

There's also cheerio, it has the jQuery interface and it's a lot faster than older versions of jsdom, although these days they are similar in performance.

You might wanna have a look at htmlparser2, which is a streaming parser, and according to its benchmark, it seems to be faster than others, and no DOM by default. It can also produce a DOM, as it is also bundled with a handler that creates a DOM. This is the parser that is used by cheerio.

parse5 also looks like a good solution. It's fairly active (11 days since the last commit as of this update), WHATWG-compliant, and is used in jsdom, Angular, and Polymer.

And if you want to parse HTML for web scraping, you can use YQL1. There is a node module for it. YQL I think would be the best solution if your HTML is from a static website, since you are relying on a service, not your own code and processing power. Though note that it won't work if the page is disallowed by the robot.txt of the website, YQL won't work with it.

If the website you're trying to scrape is dynamic then you should be using a headless browser like phantomjs. Also have a look at casperjs, if you're considering phantomjs. And you can control casperjs from node with SpookyJS.

Beside phantomjs there's zombiejs. Unlike phantomjs that cannot be embedded in nodejs, zombiejs is just a node module.

There's a nettuts+ toturial for the latter solutions.


1 Since Aug. 2014, YUI library, which is a requirement for YQL, is no longer actively maintained, source


You can also take a look at x-ray: https://github.com/lapwinglabs/x-ray


Try https://github.com/tmpvar/jsdom - you give it some HTML and it gives you a DOM.


Examples related to html

Embed ruby within URL : Middleman Blog Please help me convert this script to a simple image slider Generating a list of pages (not posts) without the index file Why there is this "clear" class before footer? Is it possible to change the content HTML5 alert messages? Getting all files in directory with ajax DevTools failed to load SourceMap: Could not load content for chrome-extension How to set width of mat-table column in angular? How to open a link in new tab using angular? ERROR Error: Uncaught (in promise), Cannot match any routes. URL Segment

Examples related to parsing

Got a NumberFormatException while trying to parse a text file for objects Uncaught SyntaxError: Unexpected end of JSON input at JSON.parse (<anonymous>) Python/Json:Expecting property name enclosed in double quotes Correctly Parsing JSON in Swift 3 How to get response as String using retrofit without using GSON or any other library in android UIButton action in table view cell "Expected BEGIN_OBJECT but was STRING at line 1 column 1" How to convert an XML file to nice pandas dataframe? How to extract multiple JSON objects from one file? How to sum digits of an integer in java?

Examples related to node.js

Hide Signs that Meteor.js was Used Querying date field in MongoDB with Mongoose SyntaxError: Cannot use import statement outside a module Server Discovery And Monitoring engine is deprecated How to fix ReferenceError: primordials is not defined in node UnhandledPromiseRejectionWarning: This error originated either by throwing inside of an async function without a catch block dyld: Library not loaded: /usr/local/opt/icu4c/lib/libicui18n.62.dylib error running php after installing node with brew on Mac internal/modules/cjs/loader.js:582 throw err DeprecationWarning: Buffer() is deprecated due to security and usability issues when I move my script to another server Please run `npm cache clean`

Examples related to nokogiri

HTML-parser on Node.js