[java] Java HTML Parsing

I'm working on an app which scrapes data from a website and I was wondering how I should go about getting the data. Specifically I need data contained in a number of div tags which use a specific CSS class - Currently (for testing purposes) I'm just checking for

div class = "classname"

in each line of HTML - This works, but I can't help but feel there is a better solution out there.

Is there any nice way where I could give a class a line of HTML and have some nice methods like:

boolean usesClass(String CSSClassname);
String getText();
String getLink();

This question is related to java html parsing web-scraping

The answer is


You might be interested by TagSoup, a Java HTML parser able to handle malformed HTML. XML parsers would work only on well formed XHTML.


If your HTML is well-formed, you can easily employ an XML parser to do the job for you... If you're only reading, SAX would be ideal.


You might be interested by TagSoup, a Java HTML parser able to handle malformed HTML. XML parsers would work only on well formed XHTML.


If your HTML is well-formed, you can easily employ an XML parser to do the job for you... If you're only reading, SAX would be ideal.


The main problem as stated by preceding coments is malformed HTML, so an html cleaner or HTML-XML converter is a must. Once you get the XML code (XHTML) there are plenty of tools to handle it. You could get it with a simple SAX handler that extracts only the data you need or any tree-based method (DOM, JDOM, etc.) that let you even modify original code.

Here is a sample code that uses HTML cleaner to get all DIVs that use a certain class and print out all Text content inside it.

import java.io.IOException;
import java.net.URL;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;

import org.htmlcleaner.HtmlCleaner;
import org.htmlcleaner.TagNode;

/**
 * @author Fernando Miguélez Palomo <fernandoDOTmiguelezATgmailDOTcom>
 */
public class TestHtmlParse
{
    static final String className = "tags";
    static final String url = "http://www.stackoverflow.com";

    TagNode rootNode;

    public TestHtmlParse(URL htmlPage) throws IOException
    {
        HtmlCleaner cleaner = new HtmlCleaner();
        rootNode = cleaner.clean(htmlPage);
    }

    List getDivsByClass(String CSSClassname)
    {
        List divList = new ArrayList();

        TagNode divElements[] = rootNode.getElementsByName("div", true);
        for (int i = 0; divElements != null && i < divElements.length; i++)
        {
            String classType = divElements[i].getAttributeByName("class");
            if (classType != null && classType.equals(CSSClassname))
            {
                divList.add(divElements[i]);
            }
        }

        return divList;
    }

    public static void main(String[] args)
    {
        try
        {
            TestHtmlParse thp = new TestHtmlParse(new URL(url));

            List divs = thp.getDivsByClass(className);
            System.out.println("*** Text of DIVs with class '"+className+"' at '"+url+"' ***");
            for (Iterator iterator = divs.iterator(); iterator.hasNext();)
            {
                TagNode divElement = (TagNode) iterator.next();
                System.out.println("Text child nodes of DIV: " + divElement.getText().toString());
            }
        }
        catch(Exception e)
        {
            e.printStackTrace();
        }
    }
}

The main problem as stated by preceding coments is malformed HTML, so an html cleaner or HTML-XML converter is a must. Once you get the XML code (XHTML) there are plenty of tools to handle it. You could get it with a simple SAX handler that extracts only the data you need or any tree-based method (DOM, JDOM, etc.) that let you even modify original code.

Here is a sample code that uses HTML cleaner to get all DIVs that use a certain class and print out all Text content inside it.

import java.io.IOException;
import java.net.URL;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;

import org.htmlcleaner.HtmlCleaner;
import org.htmlcleaner.TagNode;

/**
 * @author Fernando Miguélez Palomo <fernandoDOTmiguelezATgmailDOTcom>
 */
public class TestHtmlParse
{
    static final String className = "tags";
    static final String url = "http://www.stackoverflow.com";

    TagNode rootNode;

    public TestHtmlParse(URL htmlPage) throws IOException
    {
        HtmlCleaner cleaner = new HtmlCleaner();
        rootNode = cleaner.clean(htmlPage);
    }

    List getDivsByClass(String CSSClassname)
    {
        List divList = new ArrayList();

        TagNode divElements[] = rootNode.getElementsByName("div", true);
        for (int i = 0; divElements != null && i < divElements.length; i++)
        {
            String classType = divElements[i].getAttributeByName("class");
            if (classType != null && classType.equals(CSSClassname))
            {
                divList.add(divElements[i]);
            }
        }

        return divList;
    }

    public static void main(String[] args)
    {
        try
        {
            TestHtmlParse thp = new TestHtmlParse(new URL(url));

            List divs = thp.getDivsByClass(className);
            System.out.println("*** Text of DIVs with class '"+className+"' at '"+url+"' ***");
            for (Iterator iterator = divs.iterator(); iterator.hasNext();)
            {
                TagNode divElement = (TagNode) iterator.next();
                System.out.println("Text child nodes of DIV: " + divElement.getText().toString());
            }
        }
        catch(Exception e)
        {
            e.printStackTrace();
        }
    }
}

HTMLUnit might be of help. It does a lot more stuff too.

http://htmlunit.sourceforge.net/1


Another library that might be useful for HTML processing is jsoup. Jsoup tries to clean malformed HTML and allows html parsing in Java using jQuery like tag selector syntax.

http://jsoup.org/


You might be interested by TagSoup, a Java HTML parser able to handle malformed HTML. XML parsers would work only on well formed XHTML.


The HTMLParser project (http://htmlparser.sourceforge.net/) might be a possibility. It seems to be pretty decent at handling malformed HTML. The following snippet should do what you need:

Parser parser = new Parser(htmlInput);
CssSelectorNodeFilter cssFilter = 
    new CssSelectorNodeFilter("DIV.targetClassName");
NodeList nodes = parser.parse(cssFilter);

You might be interested by TagSoup, a Java HTML parser able to handle malformed HTML. XML parsers would work only on well formed XHTML.


The nu.validator project is an excellent, high performance HTML parser that doesn't cut corners correctness-wise.

The Validator.nu HTML Parser is an implementation of the HTML5 parsing algorithm in Java. The parser is designed to work as a drop-in replacement for the XML parser in applications that already support XHTML 1.x content with an XML parser and use SAX, DOM or XOM to interface with the parser. Low-level functionality is provided for applications that wish to perform their own IO and support document.write() with scripting. The parser core compiles on Google Web Toolkit and can be automatically translated into C++. (The C++ translation capability is currently used for porting the parser for use in Gecko.)


HTMLUnit might be of help. It does a lot more stuff too.

http://htmlunit.sourceforge.net/1


The main problem as stated by preceding coments is malformed HTML, so an html cleaner or HTML-XML converter is a must. Once you get the XML code (XHTML) there are plenty of tools to handle it. You could get it with a simple SAX handler that extracts only the data you need or any tree-based method (DOM, JDOM, etc.) that let you even modify original code.

Here is a sample code that uses HTML cleaner to get all DIVs that use a certain class and print out all Text content inside it.

import java.io.IOException;
import java.net.URL;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;

import org.htmlcleaner.HtmlCleaner;
import org.htmlcleaner.TagNode;

/**
 * @author Fernando Miguélez Palomo <fernandoDOTmiguelezATgmailDOTcom>
 */
public class TestHtmlParse
{
    static final String className = "tags";
    static final String url = "http://www.stackoverflow.com";

    TagNode rootNode;

    public TestHtmlParse(URL htmlPage) throws IOException
    {
        HtmlCleaner cleaner = new HtmlCleaner();
        rootNode = cleaner.clean(htmlPage);
    }

    List getDivsByClass(String CSSClassname)
    {
        List divList = new ArrayList();

        TagNode divElements[] = rootNode.getElementsByName("div", true);
        for (int i = 0; divElements != null && i < divElements.length; i++)
        {
            String classType = divElements[i].getAttributeByName("class");
            if (classType != null && classType.equals(CSSClassname))
            {
                divList.add(divElements[i]);
            }
        }

        return divList;
    }

    public static void main(String[] args)
    {
        try
        {
            TestHtmlParse thp = new TestHtmlParse(new URL(url));

            List divs = thp.getDivsByClass(className);
            System.out.println("*** Text of DIVs with class '"+className+"' at '"+url+"' ***");
            for (Iterator iterator = divs.iterator(); iterator.hasNext();)
            {
                TagNode divElement = (TagNode) iterator.next();
                System.out.println("Text child nodes of DIV: " + divElement.getText().toString());
            }
        }
        catch(Exception e)
        {
            e.printStackTrace();
        }
    }
}

If your HTML is well-formed, you can easily employ an XML parser to do the job for you... If you're only reading, SAX would be ideal.


HTMLUnit might be of help. It does a lot more stuff too.

http://htmlunit.sourceforge.net/1


You can also use XWiki HTML Cleaner:

It uses HTMLCleaner and extends it to generate valid XHTML 1.1 content.


The HTMLParser project (http://htmlparser.sourceforge.net/) might be a possibility. It seems to be pretty decent at handling malformed HTML. The following snippet should do what you need:

Parser parser = new Parser(htmlInput);
CssSelectorNodeFilter cssFilter = 
    new CssSelectorNodeFilter("DIV.targetClassName");
NodeList nodes = parser.parse(cssFilter);

Another library that might be useful for HTML processing is jsoup. Jsoup tries to clean malformed HTML and allows html parsing in Java using jQuery like tag selector syntax.

http://jsoup.org/


You can also use XWiki HTML Cleaner:

It uses HTMLCleaner and extends it to generate valid XHTML 1.1 content.


The nu.validator project is an excellent, high performance HTML parser that doesn't cut corners correctness-wise.

The Validator.nu HTML Parser is an implementation of the HTML5 parsing algorithm in Java. The parser is designed to work as a drop-in replacement for the XML parser in applications that already support XHTML 1.x content with an XML parser and use SAX, DOM or XOM to interface with the parser. Low-level functionality is provided for applications that wish to perform their own IO and support document.write() with scripting. The parser core compiles on Google Web Toolkit and can be automatically translated into C++. (The C++ translation capability is currently used for porting the parser for use in Gecko.)


The main problem as stated by preceding coments is malformed HTML, so an html cleaner or HTML-XML converter is a must. Once you get the XML code (XHTML) there are plenty of tools to handle it. You could get it with a simple SAX handler that extracts only the data you need or any tree-based method (DOM, JDOM, etc.) that let you even modify original code.

Here is a sample code that uses HTML cleaner to get all DIVs that use a certain class and print out all Text content inside it.

import java.io.IOException;
import java.net.URL;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;

import org.htmlcleaner.HtmlCleaner;
import org.htmlcleaner.TagNode;

/**
 * @author Fernando Miguélez Palomo <fernandoDOTmiguelezATgmailDOTcom>
 */
public class TestHtmlParse
{
    static final String className = "tags";
    static final String url = "http://www.stackoverflow.com";

    TagNode rootNode;

    public TestHtmlParse(URL htmlPage) throws IOException
    {
        HtmlCleaner cleaner = new HtmlCleaner();
        rootNode = cleaner.clean(htmlPage);
    }

    List getDivsByClass(String CSSClassname)
    {
        List divList = new ArrayList();

        TagNode divElements[] = rootNode.getElementsByName("div", true);
        for (int i = 0; divElements != null && i < divElements.length; i++)
        {
            String classType = divElements[i].getAttributeByName("class");
            if (classType != null && classType.equals(CSSClassname))
            {
                divList.add(divElements[i]);
            }
        }

        return divList;
    }

    public static void main(String[] args)
    {
        try
        {
            TestHtmlParse thp = new TestHtmlParse(new URL(url));

            List divs = thp.getDivsByClass(className);
            System.out.println("*** Text of DIVs with class '"+className+"' at '"+url+"' ***");
            for (Iterator iterator = divs.iterator(); iterator.hasNext();)
            {
                TagNode divElement = (TagNode) iterator.next();
                System.out.println("Text child nodes of DIV: " + divElement.getText().toString());
            }
        }
        catch(Exception e)
        {
            e.printStackTrace();
        }
    }
}

Let's not forget Jerry, its jQuery in java: a fast and concise Java Library that simplifies HTML document parsing, traversing and manipulating; includes usage of css3 selectors.

Example:

Jerry doc = jerry(html);
doc.$("div#jodd p.neat").css("color", "red").addClass("ohmy");

Example:

doc.form("#myform", new JerryFormHandler() {
    public void onForm(Jerry form, Map<String, String[]> parameters) {
        // process form and parameters
    }
});

Of course, these are just some quick examples to get the feeling how it all looks like.


Jericho: http://jericho.htmlparser.net/docs/index.html

Easy to use, supports not well formed HTML, a lot of examples.


The HTMLParser project (http://htmlparser.sourceforge.net/) might be a possibility. It seems to be pretty decent at handling malformed HTML. The following snippet should do what you need:

Parser parser = new Parser(htmlInput);
CssSelectorNodeFilter cssFilter = 
    new CssSelectorNodeFilter("DIV.targetClassName");
NodeList nodes = parser.parse(cssFilter);

Jericho: http://jericho.htmlparser.net/docs/index.html

Easy to use, supports not well formed HTML, a lot of examples.


Let's not forget Jerry, its jQuery in java: a fast and concise Java Library that simplifies HTML document parsing, traversing and manipulating; includes usage of css3 selectors.

Example:

Jerry doc = jerry(html);
doc.$("div#jodd p.neat").css("color", "red").addClass("ohmy");

Example:

doc.form("#myform", new JerryFormHandler() {
    public void onForm(Jerry form, Map<String, String[]> parameters) {
        // process form and parameters
    }
});

Of course, these are just some quick examples to get the feeling how it all looks like.


If your HTML is well-formed, you can easily employ an XML parser to do the job for you... If you're only reading, SAX would be ideal.


HTMLUnit might be of help. It does a lot more stuff too.

http://htmlunit.sourceforge.net/1


Examples related to java

Under what circumstances can I call findViewById with an Options Menu / Action Bar item? How much should a function trust another function How to implement a simple scenario the OO way Two constructors How do I get some variable from another class in Java? this in equals method How to split a string in two and store it in a field How to do perspective fixing? String index out of range: 4 My eclipse won't open, i download the bundle pack it keeps saying error log

Examples related to html

Embed ruby within URL : Middleman Blog Please help me convert this script to a simple image slider Generating a list of pages (not posts) without the index file Why there is this "clear" class before footer? Is it possible to change the content HTML5 alert messages? Getting all files in directory with ajax DevTools failed to load SourceMap: Could not load content for chrome-extension How to set width of mat-table column in angular? How to open a link in new tab using angular? ERROR Error: Uncaught (in promise), Cannot match any routes. URL Segment

Examples related to parsing

Got a NumberFormatException while trying to parse a text file for objects Uncaught SyntaxError: Unexpected end of JSON input at JSON.parse (<anonymous>) Python/Json:Expecting property name enclosed in double quotes Correctly Parsing JSON in Swift 3 How to get response as String using retrofit without using GSON or any other library in android UIButton action in table view cell "Expected BEGIN_OBJECT but was STRING at line 1 column 1" How to convert an XML file to nice pandas dataframe? How to extract multiple JSON objects from one file? How to sum digits of an integer in java?

Examples related to web-scraping

Scraping: SSL: CERTIFICATE_VERIFY_FAILED error for http://en.wikipedia.org How to print an exception in Python 3? What should I use to open a url instead of urlopen in urllib3 Use Excel VBA to click on a button in Internet Explorer, when the button has no "name" associated How to use Python requests to fake a browser visit a.k.a and generate User Agent? Scraping data from website using vba Using BeautifulSoup to extract text without tags Is it ok to scrape data from Google results? What's the best way of scraping data from a website? Use getElementById on HTMLElement instead of HTMLDocument