[java] Web scraping with Java

I'm not able to find any good web scraping Java based API. The site which I need to scrape does not provide any API as well; I want to iterate over all web pages using some pageID and extract the HTML titles / other stuff in their DOM trees.

Are there ways other than web scraping?

This question is related to java web-scraping frameworks

The answer is


mechanize for Java would be a good fit for this, and as Wadjy Essam mentioned it uses JSoup for the HMLT. mechanize is a stageful HTTP/HTML client that supports navigation, form submissions, and page scraping.

http://gistlabs.com/software/mechanize-for-java/ (and the GitHub here https://github.com/GistLabs/mechanize)


If you wish to automate scraping of large amount pages or data, then you could try Gotz ETL.

It is completely model driven like a real ETL tool. Data structure, task workflow and pages to scrape are defined with a set of XML definition files and no coding is required. Query can be written either using Selectors with JSoup or XPath with HtmlUnit.


You might look into jwht-scrapper!

This is a complete scrapping framework that has all the features a developper could expect from a web scrapper:

It works with (jwht-htmltopojo)[https://github.com/whimtrip/jwht-htmltopojo) lib which itsef uses Jsoup mentionned by several other people here.

Together they will help you built awesome scrappers mapping directly HTML to POJOs and bypassing any classical scrapping problems in only a matter of minutes!

Hope this might help some people here!

Disclaimer, I am the one who developed it, feel free to let me know your remarks!


HTMLUnit can be used to do web scraping, it supports invoking pages, filling & submitting forms. I have used this in my project. It is good java library for web scraping. read here for more


Your best bet is to use Selenium Web Driver since it

  1. Provides visual feedback to the coder (see your scraping in action, see where it stops)

  2. Accurate and Consistent as it directly controls the browser you use.

  3. Slow. Doesn't hit web pages like HtmlUnit does but sometimes you don't want to hit too fast.

    Htmlunit is fast but is horrible at handling Javascript and AJAX.


Normally I use selenium, which is software for testing automation. You can control a browser through a webdriver, so you will not have problems with javascripts and it is usually not very detected if you use the full version. Headless browsers can be more identified.


Look at an HTML parser such as TagSoup, HTMLCleaner or NekoHTML.


There is also Jaunt Java Web Scraping & JSON Querying - http://jaunt-api.com


For tasks of this type I usually use Crawller4j + Jsoup.

With crawler4j I download the pages from a domain, you can specify which ULR with a regular expression.

With jsoup, I "parsed" the html data you have searched for and downloaded with crawler4j.

Normally you can also download data with jsoup, but Crawler4J makes it easier to find links. Another advantage of using crawler4j is that it is multithreaded and you can configure the number of concurrent threads

https://github.com/yasserg/crawler4j/wiki


Examples related to java

Under what circumstances can I call findViewById with an Options Menu / Action Bar item? How much should a function trust another function How to implement a simple scenario the OO way Two constructors How do I get some variable from another class in Java? this in equals method How to split a string in two and store it in a field How to do perspective fixing? String index out of range: 4 My eclipse won't open, i download the bundle pack it keeps saying error log

Examples related to web-scraping

Scraping: SSL: CERTIFICATE_VERIFY_FAILED error for http://en.wikipedia.org How to print an exception in Python 3? What should I use to open a url instead of urlopen in urllib3 Use Excel VBA to click on a button in Internet Explorer, when the button has no "name" associated How to use Python requests to fake a browser visit a.k.a and generate User Agent? Scraping data from website using vba Using BeautifulSoup to extract text without tags Is it ok to scrape data from Google results? What's the best way of scraping data from a website? Use getElementById on HTMLElement instead of HTMLDocument

Examples related to frameworks

Undefined Symbols error when integrating Apptentive iOS SDK via Cocoapods Vue.js—Difference between v-model and v-bind OS X Framework Library not loaded: 'Image not found' How to get root directory in yii2 Laravel blank white screen Could not load file or assembly 'System, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089' or one of its dependencies How to filter (key, value) with ng-repeat in AngularJs? Microsoft .NET 3.5 Full download Using CSS in Laravel views? Difference between framework vs Library vs IDE vs API vs SDK vs Toolkits?