Java HTML Parsing

Question

I m working on an app which scrapes data from a website and I was wondering how I should go about getting the data   Specifically I need data contained in a number of div tags which use a specific CSS class - Currently  for testing purposes  I m just checking for   div class    classname    in each line of HTML - This works  but I can t help but feel there is a better solution out there     Is there any nice way where I could give a class a line of HTML and have some nice methods like   boolean usesClass String CSSClassname   String getText    String getLink

User · Answer

Another library that might be useful for HTML processing is jsoup  Jsoup tries to clean malformed HTML and allows html parsing in Java using jQuery like tag selector syntax   http   jsoup org

User · Answer

Jericho  http   jericho htmlparser net docs index html  Easy to use  supports not well formed HTML  a lot of examples

User · Answer

The HTMLParser project  http   htmlparser sourceforge net   might be a possibility   It seems to be pretty decent at handling malformed HTML   The following snippet should do what you need   Parser parser   new Parser htmlInput   CssSelectorNodeFilter cssFilter        new CssSelectorNodeFilter  DIV targetClassName    NodeList nodes   parser parse cssFilter

User · Answer

The nu validator project is an excellent  high performance HTML parser that doesn t cut corners correctness-wise      The Validator nu HTML Parser is an implementation of the HTML5 parsing algorithm in Java  The parser is designed to work as a drop-in replacement for the XML parser in applications that already support XHTML 1 x content with an XML parser and use SAX  DOM or XOM to interface with the parser  Low-level functionality is provided for applications that wish to perform their own IO and support document write   with scripting  The parser core compiles on Google Web Toolkit and can be automatically translated into C     The C   translation capability is currently used for porting the parser for use in Gecko

User · Answer

HTMLUnit might be of help  It does a lot more stuff too   http   htmlunit sourceforge net 1

User · Answer

If your HTML is well-formed  you can easily employ an XML parser to do the job for you    If you re only reading  SAX would be ideal

User · Answer

You can also use XWiki HTML Cleaner   It uses HTMLCleaner and extends it to generate valid XHTML 1 1 content

User · Answer

You might be interested by TagSoup  a Java HTML parser able to handle malformed HTML  XML parsers would work only on well formed XHTML

User · Answer

Another library that might be useful for HTML processing is jsoup  Jsoup tries to clean malformed HTML and allows html parsing in Java using jQuery like tag selector syntax   http   jsoup org

User · Answer

You might be interested by TagSoup  a Java HTML parser able to handle malformed HTML  XML parsers would work only on well formed XHTML

User · Answer

The HTMLParser project  http   htmlparser sourceforge net   might be a possibility   It seems to be pretty decent at handling malformed HTML   The following snippet should do what you need   Parser parser   new Parser htmlInput   CssSelectorNodeFilter cssFilter        new CssSelectorNodeFilter  DIV targetClassName    NodeList nodes   parser parse cssFilter

User · Answer

Several years ago I used JTidy for the same purpose   http   jtidy sourceforge net    JTidy is a Java port of HTML Tidy  a HTML syntax checker and pretty printer  Like its non-Java cousin  JTidy can be used as a tool for cleaning up malformed and faulty HTML  In addition  JTidy provides a DOM interface to the document that is being processed  which effectively makes you able to use JTidy as a DOM parser for real-world HTML   JTidy was written by Andy Quick  who later stepped down from the maintainer position  Now JTidy is maintained by a group of volunteers   More information on JTidy can be found on the JTidy SourceForge project page

User · Answer

The HTMLParser project  http   htmlparser sourceforge net   might be a possibility   It seems to be pretty decent at handling malformed HTML   The following snippet should do what you need   Parser parser   new Parser htmlInput   CssSelectorNodeFilter cssFilter        new CssSelectorNodeFilter  DIV targetClassName    NodeList nodes   parser parse cssFilter

User · Answer

Let s not forget Jerry  its jQuery in java  a fast and concise Java Library that simplifies HTML document parsing  traversing and manipulating  includes usage of css3 selectors   Example   Jerry doc   jerry html   doc    div jodd p neat   css  color    red   addClass  ohmy      Example   doc form   myform   new JerryFormHandler         public void onForm Jerry form  Map lt String  String   gt  parameters               process form and parameters             Of course  these are just some quick examples to get the feeling how it all looks like

User · Answer

The main problem as stated by preceding coments is malformed HTML  so an html cleaner or HTML-XML converter is a must  Once you get the XML code  XHTML  there are plenty of tools to handle it  You could get it with a simple SAX handler that extracts only the data you need or any tree-based method  DOM  JDOM  etc   that let you even modify original code   Here is a sample code that uses HTML cleaner to get all DIVs that use a certain class and print out all Text content inside it   import java io IOException  import java net URL  import java util ArrayList  import java util Iterator  import java util List   import org htmlcleaner HtmlCleaner  import org htmlcleaner TagNode           author Fernando Migu  lez Palomo  lt fernandoDOTmiguelezATgmailDOTcom gt      public class TestHtmlParse       static final String className    tags       static final String url    http   www stackoverflow com        TagNode rootNode       public TestHtmlParse URL htmlPage  throws IOException               HtmlCleaner cleaner   new HtmlCleaner            rootNode   cleaner clean htmlPage              List getDivsByClass String CSSClassname                List divList   new ArrayList             TagNode divElements     rootNode getElementsByName  div   true           for  int i   0  divElements    null  amp  amp  i  lt  divElements length  i                          String classType   divElements i  getAttributeByName  class                if  classType    null  amp  amp  classType equals CSSClassname                                 divList add divElements i                                     return divList             public static void main String   args                try                       TestHtmlParse thp   new TestHtmlParse new URL url                 List divs   thp getDivsByClass className               System out println      Text of DIVs with class    className    at    url                       for  Iterator iterator   divs iterator    iterator hasNext                                   TagNode divElement    TagNode  iterator next                    System out println  Text child nodes of DIV      divElement getText   toString                                     catch Exception e                        e printStackTrace

User · Answer

Several years ago I used JTidy for the same purpose   http   jtidy sourceforge net    JTidy is a Java port of HTML Tidy  a HTML syntax checker and pretty printer  Like its non-Java cousin  JTidy can be used as a tool for cleaning up malformed and faulty HTML  In addition  JTidy provides a DOM interface to the document that is being processed  which effectively makes you able to use JTidy as a DOM parser for real-world HTML   JTidy was written by Andy Quick  who later stepped down from the maintainer position  Now JTidy is maintained by a group of volunteers   More information on JTidy can be found on the JTidy SourceForge project page

User · Answer

Jericho  http   jericho htmlparser net docs index html  Easy to use  supports not well formed HTML  a lot of examples

User · Answer

Several years ago I used JTidy for the same purpose   http   jtidy sourceforge net    JTidy is a Java port of HTML Tidy  a HTML syntax checker and pretty printer  Like its non-Java cousin  JTidy can be used as a tool for cleaning up malformed and faulty HTML  In addition  JTidy provides a DOM interface to the document that is being processed  which effectively makes you able to use JTidy as a DOM parser for real-world HTML   JTidy was written by Andy Quick  who later stepped down from the maintainer position  Now JTidy is maintained by a group of volunteers   More information on JTidy can be found on the JTidy SourceForge project page

User · Answer

The HTMLParser project  http   htmlparser sourceforge net   might be a possibility   It seems to be pretty decent at handling malformed HTML   The following snippet should do what you need   Parser parser   new Parser htmlInput   CssSelectorNodeFilter cssFilter        new CssSelectorNodeFilter  DIV targetClassName    NodeList nodes   parser parse cssFilter

User · Answer

If your HTML is well-formed  you can easily employ an XML parser to do the job for you    If you re only reading  SAX would be ideal

User · Answer

If your HTML is well-formed  you can easily employ an XML parser to do the job for you    If you re only reading  SAX would be ideal

User · Answer

HTMLUnit might be of help  It does a lot more stuff too   http   htmlunit sourceforge net 1

User · Answer

The main problem as stated by preceding coments is malformed HTML  so an html cleaner or HTML-XML converter is a must  Once you get the XML code  XHTML  there are plenty of tools to handle it  You could get it with a simple SAX handler that extracts only the data you need or any tree-based method  DOM  JDOM  etc   that let you even modify original code   Here is a sample code that uses HTML cleaner to get all DIVs that use a certain class and print out all Text content inside it   import java io IOException  import java net URL  import java util ArrayList  import java util Iterator  import java util List   import org htmlcleaner HtmlCleaner  import org htmlcleaner TagNode           author Fernando Migu  lez Palomo  lt fernandoDOTmiguelezATgmailDOTcom gt      public class TestHtmlParse       static final String className    tags       static final String url    http   www stackoverflow com        TagNode rootNode       public TestHtmlParse URL htmlPage  throws IOException               HtmlCleaner cleaner   new HtmlCleaner            rootNode   cleaner clean htmlPage              List getDivsByClass String CSSClassname                List divList   new ArrayList             TagNode divElements     rootNode getElementsByName  div   true           for  int i   0  divElements    null  amp  amp  i  lt  divElements length  i                          String classType   divElements i  getAttributeByName  class                if  classType    null  amp  amp  classType equals CSSClassname                                 divList add divElements i                                     return divList             public static void main String   args                try                       TestHtmlParse thp   new TestHtmlParse new URL url                 List divs   thp getDivsByClass className               System out println      Text of DIVs with class    className    at    url                       for  Iterator iterator   divs iterator    iterator hasNext                                   TagNode divElement    TagNode  iterator next                    System out println  Text child nodes of DIV      divElement getText   toString                                     catch Exception e                        e printStackTrace

User · Answer

You might be interested by TagSoup  a Java HTML parser able to handle malformed HTML  XML parsers would work only on well formed XHTML

User · Answer

Let s not forget Jerry  its jQuery in java  a fast and concise Java Library that simplifies HTML document parsing  traversing and manipulating  includes usage of css3 selectors   Example   Jerry doc   jerry html   doc    div jodd p neat   css  color    red   addClass  ohmy      Example   doc form   myform   new JerryFormHandler         public void onForm Jerry form  Map lt String  String   gt  parameters               process form and parameters             Of course  these are just some quick examples to get the feeling how it all looks like

User · Answer

The main problem as stated by preceding coments is malformed HTML  so an html cleaner or HTML-XML converter is a must  Once you get the XML code  XHTML  there are plenty of tools to handle it  You could get it with a simple SAX handler that extracts only the data you need or any tree-based method  DOM  JDOM  etc   that let you even modify original code   Here is a sample code that uses HTML cleaner to get all DIVs that use a certain class and print out all Text content inside it   import java io IOException  import java net URL  import java util ArrayList  import java util Iterator  import java util List   import org htmlcleaner HtmlCleaner  import org htmlcleaner TagNode           author Fernando Migu  lez Palomo  lt fernandoDOTmiguelezATgmailDOTcom gt      public class TestHtmlParse       static final String className    tags       static final String url    http   www stackoverflow com        TagNode rootNode       public TestHtmlParse URL htmlPage  throws IOException               HtmlCleaner cleaner   new HtmlCleaner            rootNode   cleaner clean htmlPage              List getDivsByClass String CSSClassname                List divList   new ArrayList             TagNode divElements     rootNode getElementsByName  div   true           for  int i   0  divElements    null  amp  amp  i  lt  divElements length  i                          String classType   divElements i  getAttributeByName  class                if  classType    null  amp  amp  classType equals CSSClassname                                 divList add divElements i                                     return divList             public static void main String   args                try                       TestHtmlParse thp   new TestHtmlParse new URL url                 List divs   thp getDivsByClass className               System out println      Text of DIVs with class    className    at    url                       for  Iterator iterator   divs iterator    iterator hasNext                                   TagNode divElement    TagNode  iterator next                    System out println  Text child nodes of DIV      divElement getText   toString                                     catch Exception e                        e printStackTrace

User · Answer

The nu validator project is an excellent  high performance HTML parser that doesn t cut corners correctness-wise      The Validator nu HTML Parser is an implementation of the HTML5 parsing algorithm in Java  The parser is designed to work as a drop-in replacement for the XML parser in applications that already support XHTML 1 x content with an XML parser and use SAX  DOM or XOM to interface with the parser  Low-level functionality is provided for applications that wish to perform their own IO and support document write   with scripting  The parser core compiles on Google Web Toolkit and can be automatically translated into C     The C   translation capability is currently used for porting the parser for use in Gecko

User · Answer

HTMLUnit might be of help  It does a lot more stuff too   http   htmlunit sourceforge net 1

User · Answer

You might be interested by TagSoup  a Java HTML parser able to handle malformed HTML  XML parsers would work only on well formed XHTML

User · Answer

HTMLUnit might be of help  It does a lot more stuff too   http   htmlunit sourceforge net 1

User · Answer

If your HTML is well-formed  you can easily employ an XML parser to do the job for you    If you re only reading  SAX would be ideal

User · Answer

Several years ago I used JTidy for the same purpose   http   jtidy sourceforge net    JTidy is a Java port of HTML Tidy  a HTML syntax checker and pretty printer  Like its non-Java cousin  JTidy can be used as a tool for cleaning up malformed and faulty HTML  In addition  JTidy provides a DOM interface to the document that is being processed  which effectively makes you able to use JTidy as a DOM parser for real-world HTML   JTidy was written by Andy Quick  who later stepped down from the maintainer position  Now JTidy is maintained by a group of volunteers   More information on JTidy can be found on the JTidy SourceForge project page

User · Answer

You can also use XWiki HTML Cleaner   It uses HTMLCleaner and extends it to generate valid XHTML 1 1 content

User · Answer

The main problem as stated by preceding coments is malformed HTML  so an html cleaner or HTML-XML converter is a must  Once you get the XML code  XHTML  there are plenty of tools to handle it  You could get it with a simple SAX handler that extracts only the data you need or any tree-based method  DOM  JDOM  etc   that let you even modify original code   Here is a sample code that uses HTML cleaner to get all DIVs that use a certain class and print out all Text content inside it   import java io IOException  import java net URL  import java util ArrayList  import java util Iterator  import java util List   import org htmlcleaner HtmlCleaner  import org htmlcleaner TagNode           author Fernando Migu  lez Palomo  lt fernandoDOTmiguelezATgmailDOTcom gt      public class TestHtmlParse       static final String className    tags       static final String url    http   www stackoverflow com        TagNode rootNode       public TestHtmlParse URL htmlPage  throws IOException               HtmlCleaner cleaner   new HtmlCleaner            rootNode   cleaner clean htmlPage              List getDivsByClass String CSSClassname                List divList   new ArrayList             TagNode divElements     rootNode getElementsByName  div   true           for  int i   0  divElements    null  amp  amp  i  lt  divElements length  i                          String classType   divElements i  getAttributeByName  class                if  classType    null  amp  amp  classType equals CSSClassname                                 divList add divElements i                                     return divList             public static void main String   args                try                       TestHtmlParse thp   new TestHtmlParse new URL url                 List divs   thp getDivsByClass className               System out println      Text of DIVs with class    className    at    url                       for  Iterator iterator   divs iterator    iterator hasNext                                   TagNode divElement    TagNode  iterator next                    System out println  Text child nodes of DIV      divElement getText   toString                                     catch Exception e                        e printStackTrace

[java] Java HTML Parsing

Examples related to java

Examples related to html

Examples related to parsing

Examples related to web-scraping