My answer won't probably be useful to the writer of this question (I am 8 months late so not the right timing I guess) but I think it will probably be useful for many other developers that might come across this answer.
Today, I just released (in the name of my company) an HTML to POJO complete framework that you can use to map HTML to any POJO class with simply some annotations. The library itself is quite handy and features many other things all the while being very pluggable. You can have a look to it right here : https://github.com/whimtrip/jwht-htmltopojo
Imagine we need to parse the following html page :
<html>
<head>
<title>A Simple HTML Document</title>
</head>
<body>
<div class="restaurant">
<h1>A la bonne Franquette</h1>
<p>French cuisine restaurant for gourmet of fellow french people</p>
<div class="location">
<p>in <span>London</span></p>
</div>
<p>Restaurant n*18,190. Ranked 113 out of 1,550 restaurants</p>
<div class="meals">
<div class="meal">
<p>Veal Cutlet</p>
<p rating-color="green">4.5/5 stars</p>
<p>Chef Mr. Frenchie</p>
</div>
<div class="meal">
<p>Ratatouille</p>
<p rating-color="orange">3.6/5 stars</p>
<p>Chef Mr. Frenchie and Mme. French-Cuisine</p>
</div>
</div>
</div>
</body>
</html>
Let's create the POJOs we want to map it to :
public class Restaurant {
@Selector( value = "div.restaurant > h1")
private String name;
@Selector( value = "div.restaurant > p:nth-child(2)")
private String description;
@Selector( value = "div.restaurant > div:nth-child(3) > p > span")
private String location;
@Selector(
value = "div.restaurant > p:nth-child(4)"
format = "^Restaurant n\*([0-9,]+). Ranked ([0-9,]+) out of ([0-9,]+) restaurants$",
indexForRegexPattern = 1,
useDeserializer = true,
deserializer = ReplacerDeserializer.class,
preConvert = true,
postConvert = false
)
// so that the number becomes a valid number as they are shown in this format : 18,190
@ReplaceWith(value = ",", with = "")
private Long id;
@Selector(
value = "div.restaurant > p:nth-child(4)"
format = "^Restaurant n\*([0-9,]+). Ranked ([0-9,]+) out of ([0-9,]+) restaurants$",
// This time, we want the second regex group and not the first one anymore
indexForRegexPattern = 2,
useDeserializer = true,
deserializer = ReplacerDeserializer.class,
preConvert = true,
postConvert = false
)
// so that the number becomes a valid number as they are shown in this format : 18,190
@ReplaceWith(value = ",", with = "")
private Integer rank;
@Selector(value = ".meal")
private List<Meal> meals;
// getters and setters
}
And now the Meal
class as well :
public class Meal {
@Selector(value = "p:nth-child(1)")
private String name;
@Selector(
value = "p:nth-child(2)",
format = "^([0-9.]+)\/5 stars$",
indexForRegexPattern = 1
)
private Float stars;
@Selector(
value = "p:nth-child(2)",
// rating-color custom attribute can be used as well
attr = "rating-color"
)
private String ratingColor;
@Selector(
value = "p:nth-child(3)"
)
private String chefs;
// getters and setters.
}
We provided some more explanations on the above code on our github page.
For the moment, let's see how to scrap this.
private static final String MY_HTML_FILE = "my-html-file.html";
public static void main(String[] args) {
HtmlToPojoEngine htmlToPojoEngine = HtmlToPojoEngine.create();
HtmlAdapter<Restaurant> adapter = htmlToPojoEngine.adapter(Restaurant.class);
// If they were several restaurants in the same page,
// you would need to create a parent POJO containing
// a list of Restaurants as shown with the meals here
Restaurant restaurant = adapter.fromHtml(getHtmlBody());
// That's it, do some magic now!
}
private static String getHtmlBody() throws IOException {
byte[] encoded = Files.readAllBytes(Paths.get(MY_HTML_FILE));
return new String(encoded, Charset.forName("UTF-8"));
}
Another short example can be found here
Hope this will help someone out there!