For tasks of this type I usually use Crawller4j + Jsoup.
With crawler4j I download the pages from a domain, you can specify which ULR with a regular expression.
With jsoup, I "parsed" the html data you have searched for and downloaded with crawler4j.
Normally you can also download data with jsoup, but Crawler4J makes it easier to find links. Another advantage of using crawler4j is that it is multithreaded and you can configure the number of concurrent threads