How to scan a website or page for info and bring it into my program

Question

Well  I m pretty much trying to figure out how to pull information from a webpage  and bring it into my program  in Java     For example  if I know the exact page I want info from  for the sake of simplicity a Best Buy item page  how would I get the appropriate info I need off of that page  Like the title  price  description    What would this process even be called  I have no idea were to even begin researching this   Edit  Okay  I m running a test for the JSoup the one posted by BalusC   but I keep getting this error   Exception in thread  main  java lang NoSuchMethodError  java util LinkedList peekFirst  Ljava lang Object  at org jsoup parser TokenQueue consumeWord TokenQueue java 209  at org jsoup parser Parser parseStartTag Parser java 117  at org jsoup parser Parser parse Parser java 76  at org jsoup parser Parser parse Parser java 51  at org jsoup Jsoup parse Jsoup java 28  at org jsoup Jsoup parse Jsoup java 56  at test main test java 12    I do have Apache Commons

User · Answer

jsoup supports java 1 5  https   github com tburch jsoup commit d8ea84f46e009a7f144ee414a9fa73ea187019a3  looks like that stack was a bug  and has been fixed

User · Answer

You could also try jARVEST   It is based on a JRuby DSL over a pure-Java engine to spider-scrape-transform web sites   Example   Find all links inside a web page  wget and xpath are constructs of the jARVEST s language    wget   xpath    a  href     Inside a Java program   Jarvest jarvest   new Jarvest      String   results   jarvest exec       wget   xpath    a  href       robot        http   www google com    inputs        for  String s   results       System out println s

User · Answer

Use a HTML parser like Jsoup  This has my preference above the other HTML parsers available in Java since it supports jQuery like CSS selectors  Also  its class representing a list of nodes  Elements  implements Iterable so that you can iterate over it in an enhanced for loop  so there s no need to hassle with verbose Node and NodeList like classes in the average Java DOM parser    Here s a basic kickoff example  just put the latest Jsoup JAR file in classpath    package com stackoverflow q2835505   import org jsoup Jsoup  import org jsoup nodes Document  import org jsoup nodes Element  import org jsoup select Elements   public class Test        public static void main String   args  throws Exception           String url    https   stackoverflow com questions 2835505           Document document   Jsoup connect url  get             String question   document select   question  post-text   text            System out println  Question      question            Elements answerers   document select   answers  user-details a            for  Element answerer   answerers                System out println  Answerer      answerer text                          As you might have guessed  this prints your own question and the names of all answerers

User · Answer

Look into the cURL library   I ve never used it in Java  but I m sure there must be bindings for it   Basically  what you ll do is send a cURL request to whatever page you want to  scrape    The request will return a string with the source code to the page   From there  you will use regex to parse whatever data you want from the source code   That s generally how you are going to do it

User · Answer

This is referred to as screen scraping  wikipedia has this article on the more specific web scraping  It can be a major challenge because there s some ugly  mess-up  broken-if-not-for-browser-cleverness HTML out there  so good luck

User · Answer

JSoup solution is great  but if you need to extract just something really simple it may be easier to use regex or String indexOf  As others have already mentioned the process is called scraping

User · Answer

My answer won t probably be useful to the writer of this question  I am 8 months late so not the right timing I guess  but I think it will probably be useful for many other developers that might come across this answer   Today  I just released  in the name of my company  an HTML to POJO complete framework that you can use to map HTML to any POJO class with simply some annotations  The library itself is quite handy and features many other things all the while being very pluggable  You can have a look to it right here   https   github com whimtrip jwht-htmltopojo  How to use   Basics  Imagine we need to parse the following html page     lt html gt       lt head gt           lt title gt A Simple HTML Document lt  title gt       lt  head gt       lt body gt           lt div class  restaurant  gt               lt h1 gt A la bonne Franquette lt  h1 gt               lt p gt French cuisine restaurant for gourmet of fellow french people lt  p gt               lt div class  location  gt                   lt p gt in  lt span gt London lt  span gt  lt  p gt               lt  div gt               lt p gt Restaurant n 18 190  Ranked 113 out of 1 550 restaurants lt  p gt                 lt div class  meals  gt                   lt div class  meal  gt                       lt p gt Veal Cutlet lt  p gt                       lt p rating-color  green  gt 4 5 5 stars lt  p gt                       lt p gt Chef Mr  Frenchie lt  p gt                   lt  div gt                    lt div class  meal  gt                       lt p gt Ratatouille lt  p gt                       lt p rating-color  orange  gt 3 6 5 stars lt  p gt                       lt p gt Chef Mr  Frenchie and Mme  French-Cuisine lt  p gt                   lt  div gt                lt  div gt            lt  div gt           lt  body gt   lt  html gt    Let s create the POJOs we want to map it to    public class Restaurant         Selector  value    div restaurant  gt  h1       private String name        Selector  value    div restaurant  gt  p nth-child 2        private String description        Selector  value    div restaurant  gt  div nth-child 3   gt  p  gt  span           private String location            Selector           value    div restaurant  gt  p nth-child 4           format     Restaurant n    0-9      Ranked   0-9     out of   0-9     restaurants            indexForRegexPattern   1          useDeserializer   true          deserializer   ReplacerDeserializer class          preConvert   true          postConvert   false              so that the number becomes a valid number as they are shown in this format   18 190      ReplaceWith value        with           private Long id        Selector           value    div restaurant  gt  p nth-child 4           format     Restaurant n    0-9      Ranked   0-9     out of   0-9     restaurants               This time  we want the second regex group and not the first one anymore         indexForRegexPattern   2          useDeserializer   true          deserializer   ReplacerDeserializer class          preConvert   true          postConvert   false              so that the number becomes a valid number as they are shown in this format   18 190      ReplaceWith value        with           private Integer rank        Selector value     meal           private List lt Meal gt  meals          getters and setters      And now the Meal class as well    public class Meal         Selector value    p nth-child 1        private String name        Selector          value    p nth-child 2            format       0-9      5 stars            indexForRegexPattern   1           private Float stars        Selector          value    p nth-child 2               rating-color custom attribute can be used as well         attr    rating-color            private String ratingColor        Selector          value    p nth-child 3             private String chefs          getters and setters      We provided some more explanations on the above code on our github page   For the moment  let s see how to scrap this   private static final String MY HTML FILE    my-html-file html    public static void main String   args          HtmlToPojoEngine htmlToPojoEngine   HtmlToPojoEngine create         HtmlAdapter lt Restaurant gt  adapter   htmlToPojoEngine adapter Restaurant class           If they were several restaurants in the same page          you would need to create a parent POJO containing        a list of Restaurants as shown with the meals here     Restaurant restaurant   adapter fromHtml getHtmlBody             That s it  do some magic now       private static String getHtmlBody   throws IOException       byte   encoded   Files readAllBytes Paths get MY HTML FILE        return new String encoded  Charset forName  UTF-8          Another short example can be found here  Hope this will help someone out there

User · Answer

I would use JTidy - it is simlar to JSoup  but I don t know JSoup well  JTidy handles broken HTML and returns a w3c Document  so you can use this as a source to XSLT to extract the content you are really interested in  If you don t know XSLT  then you might as well go with JSoup  as the Document model is nicer to work with than w3c   EDIT  A quick look on the JSoup website shows that JSoup may indeed be the better choice  It seems to support CSS selectors out the box for extracting stuff from the document  This may be a lot easier to work with than getting into XSLT

User · Answer

You may use an html parser  many useful links here  java html parser    The process is called  grabbing website content   Search  grab website content java  for further invertigation

User · Answer

You d probably want to look at the HTML to see if you can find strings that are unique and near your text  then you can use line char-offsets to get to the data   Could be awkward in Java  if there aren t any XML classes similar to the ones found in System XML Linq in C

[java] How to "scan" a website (or page) for info, and bring it into my program?

Examples related to java

Examples related to html

Examples related to web-scraping

Examples related to jsoup