Web scraping with Java

Question

I m not able to find any good web scraping Java based API  The site which I need to scrape does not provide any API as well  I want to iterate over all web pages using some pageID and extract the HTML titles   other stuff in their DOM trees   Are there ways other than web scraping

User · Answer

Your best bet is to use Selenium Web Driver since it

Provides visual feedback to the coder (see your scraping in action, see where it stops)
Accurate and Consistent as it directly controls the browser you use.
Slow. Doesn't hit web pages like HtmlUnit does but sometimes you don't want to hit too fast.

Htmlunit is fast but is horrible at handling Javascript and AJAX.

User · Answer

jsoup  Extracting the title is not difficult  and you have many options  search here on Stack Overflow for  Java HTML parsers   One of them is Jsoup   You can navigate the page using DOM if you know the page structure  see http   jsoup org cookbook extracting-data dom-navigation  It s a good library and I ve used it in my last projects

User · Answer

Normally I use selenium  which is software for testing automation  You can control a browser through a webdriver  so you will not have problems with javascripts and it is usually not very detected if you use the full version  Headless browsers can be more identified

User · Answer

You might look into jwht-scrapper   This is a complete scrapping framework that has all the features a developper could expect from a web scrapper    Proxy support Warning Sign Support to detect captchas and more Complex link following features Multithreading Various scrapping delays when required Rotating User-Agent Request auto retry and HTTP redirections supports  HTTP headers  cookies and more support GET and POST support Annotation Configuration Detailed Scrapping Metrics Async handling of the scrapper client jwht-htmltopojo fully featured framework to map HTML to POJO Custom Input Format handling and built in JSON -  POJO mapping Full Exception Handling Control Detailed Logging with log4j POJO injection Custom processing hooks Easy to use and well documented API   It works with  jwht-htmltopojo  https   github com whimtrip jwht-htmltopojo  lib which itsef uses Jsoup mentionned by several other people here   Together they will help you built awesome scrappers mapping directly HTML to POJOs and bypassing any classical scrapping problems in only a matter of minutes    Hope this might help some people here   Disclaimer  I am the one who developed it  feel free to let me know your remarks

User · Answer

There is also Jaunt Java Web Scraping  amp  JSON Querying - http   jaunt-api com

User · Answer

mechanize for Java would be a good fit for this  and as Wadjy Essam mentioned it uses JSoup for the HMLT  mechanize is a stageful HTTP HTML client that supports navigation  form submissions  and page scraping   http   gistlabs com software mechanize-for-java   and the GitHub here https   github com GistLabs mechanize

User · Answer

Look at an HTML parser such as TagSoup  HTMLCleaner or NekoHTML

User · Answer

HTMLUnit can be used to do web scraping  it supports invoking pages  filling  amp  submitting forms  I have used this in my project  It is good java library for web scraping  read here for more

User · Answer

For tasks of this type I usually use Crawller4j   Jsoup  With crawler4j I download the pages from a domain  you can specify which ULR with a regular expression  With jsoup  I  quot parsed quot  the html data you have searched for and downloaded with crawler4j  Normally you can also download data with jsoup  but Crawler4J makes it easier to find links  Another advantage of using crawler4j is that it is multithreaded and you can configure the number of concurrent threads https   github com yasserg crawler4j wiki

User · Answer

If you wish to automate scraping of large amount pages or data  then you could try Gotz ETL    It is completely model driven like a real ETL tool  Data structure  task workflow and pages to scrape are defined with a set of XML definition files and no coding is required  Query can be written either using Selectors with JSoup or XPath with HtmlUnit

[java] Web scraping with Java

Examples related to java

Examples related to web-scraping

Examples related to frameworks