[java] How to "scan" a website (or page) for info, and bring it into my program?

My answer won't probably be useful to the writer of this question (I am 8 months late so not the right timing I guess) but I think it will probably be useful for many other developers that might come across this answer.

Today, I just released (in the name of my company) an HTML to POJO complete framework that you can use to map HTML to any POJO class with simply some annotations. The library itself is quite handy and features many other things all the while being very pluggable. You can have a look to it right here : https://github.com/whimtrip/jwht-htmltopojo

How to use : Basics

Imagine we need to parse the following html page :

<html>
    <head>
        <title>A Simple HTML Document</title>
    </head>
    <body>
        <div class="restaurant">
            <h1>A la bonne Franquette</h1>
            <p>French cuisine restaurant for gourmet of fellow french people</p>
            <div class="location">
                <p>in <span>London</span></p>
            </div>
            <p>Restaurant n*18,190. Ranked 113 out of 1,550 restaurants</p>  
            <div class="meals">
                <div class="meal">
                    <p>Veal Cutlet</p>
                    <p rating-color="green">4.5/5 stars</p>
                    <p>Chef Mr. Frenchie</p>
                </div>

                <div class="meal">
                    <p>Ratatouille</p>
                    <p rating-color="orange">3.6/5 stars</p>
                    <p>Chef Mr. Frenchie and Mme. French-Cuisine</p>
                </div>

            </div> 
        </div>    
    </body>
</html>

Let's create the POJOs we want to map it to :

public class Restaurant {

    @Selector( value = "div.restaurant > h1")
    private String name;

    @Selector( value = "div.restaurant > p:nth-child(2)")
    private String description;

    @Selector( value = "div.restaurant > div:nth-child(3) > p > span")    
    private String location;    

    @Selector( 
        value = "div.restaurant > p:nth-child(4)"
        format = "^Restaurant n\*([0-9,]+). Ranked ([0-9,]+) out of ([0-9,]+) restaurants$",
        indexForRegexPattern = 1,
        useDeserializer = true,
        deserializer = ReplacerDeserializer.class,
        preConvert = true,
        postConvert = false
    )
    // so that the number becomes a valid number as they are shown in this format : 18,190
    @ReplaceWith(value = ",", with = "")
    private Long id;

    @Selector( 
        value = "div.restaurant > p:nth-child(4)"
        format = "^Restaurant n\*([0-9,]+). Ranked ([0-9,]+) out of ([0-9,]+) restaurants$",
        // This time, we want the second regex group and not the first one anymore
        indexForRegexPattern = 2,
        useDeserializer = true,
        deserializer = ReplacerDeserializer.class,
        preConvert = true,
        postConvert = false
    )
    // so that the number becomes a valid number as they are shown in this format : 18,190
    @ReplaceWith(value = ",", with = "")
    private Integer rank;

    @Selector(value = ".meal")    
    private List<Meal> meals;

    // getters and setters

}

And now the Meal class as well :

public class Meal {

    @Selector(value = "p:nth-child(1)")
    private String name;

    @Selector(
        value = "p:nth-child(2)",
        format = "^([0-9.]+)\/5 stars$",
        indexForRegexPattern = 1
    )
    private Float stars;

    @Selector(
        value = "p:nth-child(2)",
        // rating-color custom attribute can be used as well
        attr = "rating-color"
    )
    private String ratingColor;

    @Selector(
        value = "p:nth-child(3)"
    )
    private String chefs;

    // getters and setters.
}

We provided some more explanations on the above code on our github page.

For the moment, let's see how to scrap this.

private static final String MY_HTML_FILE = "my-html-file.html";

public static void main(String[] args) {


    HtmlToPojoEngine htmlToPojoEngine = HtmlToPojoEngine.create();

    HtmlAdapter<Restaurant> adapter = htmlToPojoEngine.adapter(Restaurant.class);

    // If they were several restaurants in the same page, 
    // you would need to create a parent POJO containing
    // a list of Restaurants as shown with the meals here
    Restaurant restaurant = adapter.fromHtml(getHtmlBody());

    // That's it, do some magic now!

}


private static String getHtmlBody() throws IOException {
    byte[] encoded = Files.readAllBytes(Paths.get(MY_HTML_FILE));
    return new String(encoded, Charset.forName("UTF-8"));

}

Another short example can be found here

Hope this will help someone out there!

Examples related to java

Under what circumstances can I call findViewById with an Options Menu / Action Bar item? How much should a function trust another function How to implement a simple scenario the OO way Two constructors How do I get some variable from another class in Java? this in equals method How to split a string in two and store it in a field How to do perspective fixing? String index out of range: 4 My eclipse won't open, i download the bundle pack it keeps saying error log

Examples related to html

Embed ruby within URL : Middleman Blog Please help me convert this script to a simple image slider Generating a list of pages (not posts) without the index file Why there is this "clear" class before footer? Is it possible to change the content HTML5 alert messages? Getting all files in directory with ajax DevTools failed to load SourceMap: Could not load content for chrome-extension How to set width of mat-table column in angular? How to open a link in new tab using angular? ERROR Error: Uncaught (in promise), Cannot match any routes. URL Segment

Examples related to web-scraping

Scraping: SSL: CERTIFICATE_VERIFY_FAILED error for http://en.wikipedia.org How to print an exception in Python 3? What should I use to open a url instead of urlopen in urllib3 Use Excel VBA to click on a button in Internet Explorer, when the button has no "name" associated How to use Python requests to fake a browser visit a.k.a and generate User Agent? Scraping data from website using vba Using BeautifulSoup to extract text without tags Is it ok to scrape data from Google results? What's the best way of scraping data from a website? Use getElementById on HTMLElement instead of HTMLDocument

Examples related to jsoup

How to "scan" a website (or page) for info, and bring it into my program?