Which HTML Parser is the best

Question

I code a lot of parsers  Up until now  I was using HtmlUnit headless browser for parsing and browser automation   Now  I want to separate both the tasks   As 80  of my work involves just parsing  I want to use a light HTML parser because it takes much time in HtmlUnit to first load a page  then get the source and then parse it   I want to know which HTML parser is the best  The parser would be better if it is close to HtmlUnit parser     EDIT   By best  I want at least the following features    Speed Ease to locate any HtmlElement by its  id  or  name  or  tag type     It would be ok for me if it doesn t clean the dirty HTML code  I don t need to clean any HTML source  I just need an easiest way to move across HtmlElements and harvest data from them

User · Answer

The best I ve seen so far is HtmlCleaner      HtmlCleaner is open-source HTML parser written in Java  HTML found on Web is usually dirty  ill-formed and unsuitable for further processing  For any serious consumption of such documents  it is necessary to first clean up the mess and bring the order to tags  attributes and ordinary text  For the given HTML document  HtmlCleaner reorders individual elements and produces well-formed XML  By default  it follows similar rules that the most of web browsers use in order to create Document Object Model  However  user may provide custom tag and rule set for tag filtering and balancing    With HtmlCleaner you can locate any element using XPath   For other html parsers see this SO question

User · Answer

I suggest Validator nu s parser  based on the HTML5 parsing algorithm  It is the parser used in Mozilla from 2010-05-03

User · Answer

Self plug  I have just released a new Java HTML parser  jsoup  I mention it here because I think it will do what you are after   Its party trick is a CSS selector syntax to find elements  e g    String html     lt html gt  lt head gt  lt title gt First parse lt  title gt  lt  head gt         lt body gt  lt p gt Parsed HTML into a doc  lt  p gt  lt  body gt  lt  html gt    Document doc   Jsoup parse html   Elements links   doc select  a    Element head   doc select  head   first      See the Selector javadoc for more info   This is a new project  so any ideas for improvement are very welcome

[java] Which HTML Parser is the best?

Examples related to java

Examples related to html

Examples related to parsing

Examples related to html-parsing

Examples related to web-scraping