How do you Programmatically Download a Webpage in Java

Question

I would like to be able to fetch a web page s html and save it to a String  so I can do some processing on it   Also  how could I handle various types of compression   How would I go about doing that using Java

User · Answer

Get help from this class it get code and filter some information     public class MainActivity extends AppCompatActivity        EditText url       Override     protected void onCreate Bundle savedInstanceState            super onCreate  savedInstanceState            setContentView  R layout activity main             url     EditText findViewById  R id editText            DownloadCode obj   new DownloadCode             try               String des                   String tag1    lt div class   description   gt                String l   obj execute   http   www nu edu pk Campus Chiniot-Faisalabad Faculty    get                 url setText  l                url setText                      String   t1   l split tag1               String   t2   t1 0  split    lt  div gt                  url setText  t2 0                        catch  Exception e                        Toast makeText  this e toString   Toast LENGTH SHORT   show                                                                input  extrafunctionrunparallel  output     class DownloadCode extends AsyncTask lt String Void String gt                 Override         protected String doInBackground String    WebAddress     string of webAddress separate by                           String htmlcontent                    try                   URL url   new URL  WebAddress 0                     HttpURLConnection c    HttpURLConnection  url openConnection                    c connect                    InputStream input   c getInputStream                    int data                  InputStreamReader reader   new InputStreamReader  input                     data   reader read                     while  data    -1                                        char content    char  data                      htmlcontent  content                      data   reader read                                                catch  Exception e                                Log i  Status     e toString                               return htmlcontent

User · Answer

Well  you could go with the built-in libraries such as URL and URLConnection  but they don t give very much control   Personally I d go with the Apache HTTPClient library  Edit  HTTPClient has been set to end of life by Apache  The replacement is  HTTP Components

User · Answer

Here s some tested code using Java s URL class   I d recommend do a better job than I do here of handling the exceptions or passing them up the call stack  though   public static void main String   args        URL url      InputStream is   null      BufferedReader br      String line       try           url   new URL  http   stackoverflow com             is   url openStream        throws an IOException         br   new BufferedReader new InputStreamReader is             while   line   br readLine       null                System out println line                   catch  MalformedURLException mue             mue printStackTrace          catch  IOException ioe             ioe printStackTrace          finally           try               if  is    null  is close              catch  IOException ioe                   nothing to see here

User · Answer

You d most likely need to extract code from a secure web page  https protocol   In the following example  the html file is being saved into c  temp filename html   Enjoy   import java io BufferedReader  import java io BufferedWriter  import java io FileWriter  import java io InputStream  import java io InputStreamReader  import java net URL   import javax net ssl HttpsURLConnection           lt b gt Get the Html source from the secure url  lt  b gt      public class HttpsClientUtil       public static void main String   args  throws Exception           String httpsURL    https   stackoverflow com           String FILENAME    c   temp  filename html           BufferedWriter bw   new BufferedWriter new FileWriter FILENAME            URL myurl   new URL httpsURL           HttpsURLConnection con    HttpsURLConnection  myurl openConnection            con setRequestProperty    User-Agent    Mozilla 5 0  Windows NT 10 0  Win64  x64  rv 63 0  Gecko 20100101 Firefox 63 0             InputStream ins   con getInputStream            InputStreamReader isr   new InputStreamReader ins   Windows-1252            BufferedReader in   new BufferedReader isr           String inputLine              Write each line into the file         while   inputLine   in readLine       null                System out println inputLine               bw write inputLine                     in close             bw close

User · Answer

I d use a decent HTML parser like Jsoup  It s then as easy as  String html   Jsoup connect  quot http   stackoverflow com quot   get   html     It handles GZIP and chunked responses and character encoding fully transparently  It offers more advantages as well  like HTML traversing and manipulation by CSS selectors like as jQuery can do  You only have to grab it as Document  not as a String  Document document   Jsoup connect  quot http   google com quot   get     You really don t want to run basic String methods or even regex on HTML to process it  See also   What are the pros and cons of leading HTML parsers in Java

User · Answer

Bill s answer is very good  but you may want to do some things with the request like compression or user-agents   The following code shows how you can various types of compression to your requests   URL url   new URL urlStr   HttpURLConnection conn    HttpURLConnection  url openConnection       Cast shouldn t fail HttpURLConnection setFollowRedirects true      allow both GZip and Deflate  ZLib  encodings conn setRequestProperty  Accept-Encoding    gzip  deflate    String encoding   conn getContentEncoding    InputStream inStr   null      create the appropriate stream wrapper based on    the encoding type if  encoding    null  amp  amp  encoding equalsIgnoreCase  gzip          inStr   new GZIPInputStream conn getInputStream       else if  encoding    null  amp  amp  encoding equalsIgnoreCase  deflate          inStr   new InflaterInputStream conn getInputStream          new Inflater true      else       inStr   conn getInputStream        To also set the user-agent add the following code   conn setRequestProperty    User-agent    my agent name

User · Answer

Jetty has an HTTP client which can be use to download a web page   package com zetcode   import org eclipse jetty client HttpClient  import org eclipse jetty client api ContentResponse   public class ReadWebPageEx5        public static void main String   args  throws Exception            HttpClient client   null           try                client   new HttpClient                client start                 String url    http   www something com                ContentResponse res   client GET url                System out println res getContentAsString                finally                if  client    null                     client stop                                      The example prints the contents of a simple web page    In a Reading a web page in Java tutorial I have written six examples of dowloading a web page programmaticaly in Java using URL  JSoup  HtmlCleaner  Apache HttpClient  Jetty HttpClient  and HtmlUnit

User · Answer

On a Unix Linux box you could just run  wget  but this is not really an option if you re writing a cross-platform client  Of course this assumes that you don t really want to do much with the data you download between the point of downloading it and it hitting the disk

User · Answer

Well  you could go with the built-in libraries such as URL and URLConnection  but they don t give very much control   Personally I d go with the Apache HTTPClient library  Edit  HTTPClient has been set to end of life by Apache  The replacement is  HTTP Components

User · Answer

You d most likely need to extract code from a secure web page  https protocol   In the following example  the html file is being saved into c  temp filename html   Enjoy   import java io BufferedReader  import java io BufferedWriter  import java io FileWriter  import java io InputStream  import java io InputStreamReader  import java net URL   import javax net ssl HttpsURLConnection           lt b gt Get the Html source from the secure url  lt  b gt      public class HttpsClientUtil       public static void main String   args  throws Exception           String httpsURL    https   stackoverflow com           String FILENAME    c   temp  filename html           BufferedWriter bw   new BufferedWriter new FileWriter FILENAME            URL myurl   new URL httpsURL           HttpsURLConnection con    HttpsURLConnection  myurl openConnection            con setRequestProperty    User-Agent    Mozilla 5 0  Windows NT 10 0  Win64  x64  rv 63 0  Gecko 20100101 Firefox 63 0             InputStream ins   con getInputStream            InputStreamReader isr   new InputStreamReader ins   Windows-1252            BufferedReader in   new BufferedReader isr           String inputLine              Write each line into the file         while   inputLine   in readLine       null                System out println inputLine               bw write inputLine                     in close             bw close

User · Answer

Get help from this class it get code and filter some information     public class MainActivity extends AppCompatActivity        EditText url       Override     protected void onCreate Bundle savedInstanceState            super onCreate  savedInstanceState            setContentView  R layout activity main             url     EditText findViewById  R id editText            DownloadCode obj   new DownloadCode             try               String des                   String tag1    lt div class   description   gt                String l   obj execute   http   www nu edu pk Campus Chiniot-Faisalabad Faculty    get                 url setText  l                url setText                      String   t1   l split tag1               String   t2   t1 0  split    lt  div gt                  url setText  t2 0                        catch  Exception e                        Toast makeText  this e toString   Toast LENGTH SHORT   show                                                                input  extrafunctionrunparallel  output     class DownloadCode extends AsyncTask lt String Void String gt                 Override         protected String doInBackground String    WebAddress     string of webAddress separate by                           String htmlcontent                    try                   URL url   new URL  WebAddress 0                     HttpURLConnection c    HttpURLConnection  url openConnection                    c connect                    InputStream input   c getInputStream                    int data                  InputStreamReader reader   new InputStreamReader  input                     data   reader read                     while  data    -1                                        char content    char  data                      htmlcontent  content                      data   reader read                                                catch  Exception e                                Log i  Status     e toString                               return htmlcontent

User · Answer

Here s some tested code using Java s URL class   I d recommend do a better job than I do here of handling the exceptions or passing them up the call stack  though   public static void main String   args        URL url      InputStream is   null      BufferedReader br      String line       try           url   new URL  http   stackoverflow com             is   url openStream        throws an IOException         br   new BufferedReader new InputStreamReader is             while   line   br readLine       null                System out println line                   catch  MalformedURLException mue             mue printStackTrace          catch  IOException ioe             ioe printStackTrace          finally           try               if  is    null  is close              catch  IOException ioe                   nothing to see here

User · Answer

On a Unix Linux box you could just run  wget  but this is not really an option if you re writing a cross-platform client  Of course this assumes that you don t really want to do much with the data you download between the point of downloading it and it hitting the disk

User · Answer

All the above mentioned approaches do not download the web page text as it looks in the browser  these days a lot of data is loaded into browsers through scripts in html pages  none of above mentioned techniques supports scripts  they just downloads the html text only  HTMLUNIT supports the javascripts  so if you are looking to download the web page text as it looks in the browser then you should use HTMLUNIT

User · Answer

Here s some tested code using Java s URL class   I d recommend do a better job than I do here of handling the exceptions or passing them up the call stack  though   public static void main String   args        URL url      InputStream is   null      BufferedReader br      String line       try           url   new URL  http   stackoverflow com             is   url openStream        throws an IOException         br   new BufferedReader new InputStreamReader is             while   line   br readLine       null                System out println line                   catch  MalformedURLException mue             mue printStackTrace          catch  IOException ioe             ioe printStackTrace          finally           try               if  is    null  is close              catch  IOException ioe                   nothing to see here

User · Answer

I d use a decent HTML parser like Jsoup  It s then as easy as  String html   Jsoup connect  quot http   stackoverflow com quot   get   html     It handles GZIP and chunked responses and character encoding fully transparently  It offers more advantages as well  like HTML traversing and manipulation by CSS selectors like as jQuery can do  You only have to grab it as Document  not as a String  Document document   Jsoup connect  quot http   google com quot   get     You really don t want to run basic String methods or even regex on HTML to process it  See also   What are the pros and cons of leading HTML parsers in Java

User · Answer

I used the actual answer to this post  url  and writing the output into a   file    package test   import java net    import java io     public class PDFTest       public static void main String   args  throws Exception       try           URL oracle   new URL  http   www fetagracollege org            BufferedReader in   new BufferedReader new InputStreamReader oracle openStream               String fileName    D   a 01  output txt            PrintWriter writer   new PrintWriter fileName   UTF-8            OutputStream outputStream   new FileOutputStream fileName           String inputLine           while   inputLine   in readLine       null                System out println inputLine               writer println inputLine                     in close              catch Exception e

User · Answer

Here s some tested code using Java s URL class   I d recommend do a better job than I do here of handling the exceptions or passing them up the call stack  though   public static void main String   args        URL url      InputStream is   null      BufferedReader br      String line       try           url   new URL  http   stackoverflow com             is   url openStream        throws an IOException         br   new BufferedReader new InputStreamReader is             while   line   br readLine       null                System out println line                   catch  MalformedURLException mue             mue printStackTrace          catch  IOException ioe             ioe printStackTrace          finally           try               if  is    null  is close              catch  IOException ioe                   nothing to see here

User · Answer

Jetty has an HTTP client which can be use to download a web page   package com zetcode   import org eclipse jetty client HttpClient  import org eclipse jetty client api ContentResponse   public class ReadWebPageEx5        public static void main String   args  throws Exception            HttpClient client   null           try                client   new HttpClient                client start                 String url    http   www something com                ContentResponse res   client GET url                System out println res getContentAsString                finally                if  client    null                     client stop                                      The example prints the contents of a simple web page    In a Reading a web page in Java tutorial I have written six examples of dowloading a web page programmaticaly in Java using URL  JSoup  HtmlCleaner  Apache HttpClient  Jetty HttpClient  and HtmlUnit

User · Answer

To do so using NIO 2 powerful Files copy InputStream in  Path target    URL url   new URL   http   download me      Files copy  url openStream    Paths get  downloaded html

User · Answer

Bill s answer is very good  but you may want to do some things with the request like compression or user-agents   The following code shows how you can various types of compression to your requests   URL url   new URL urlStr   HttpURLConnection conn    HttpURLConnection  url openConnection       Cast shouldn t fail HttpURLConnection setFollowRedirects true      allow both GZip and Deflate  ZLib  encodings conn setRequestProperty  Accept-Encoding    gzip  deflate    String encoding   conn getContentEncoding    InputStream inStr   null      create the appropriate stream wrapper based on    the encoding type if  encoding    null  amp  amp  encoding equalsIgnoreCase  gzip          inStr   new GZIPInputStream conn getInputStream       else if  encoding    null  amp  amp  encoding equalsIgnoreCase  deflate          inStr   new InflaterInputStream conn getInputStream          new Inflater true      else       inStr   conn getInputStream        To also set the user-agent add the following code   conn setRequestProperty    User-agent    my agent name

User · Answer

On a Unix Linux box you could just run  wget  but this is not really an option if you re writing a cross-platform client  Of course this assumes that you don t really want to do much with the data you download between the point of downloading it and it hitting the disk

User · Answer

Well  you could go with the built-in libraries such as URL and URLConnection  but they don t give very much control   Personally I d go with the Apache HTTPClient library  Edit  HTTPClient has been set to end of life by Apache  The replacement is  HTTP Components

User · Answer

On a Unix Linux box you could just run  wget  but this is not really an option if you re writing a cross-platform client  Of course this assumes that you don t really want to do much with the data you download between the point of downloading it and it hitting the disk

User · Answer

All the above mentioned approaches do not download the web page text as it looks in the browser  these days a lot of data is loaded into browsers through scripts in html pages  none of above mentioned techniques supports scripts  they just downloads the html text only  HTMLUNIT supports the javascripts  so if you are looking to download the web page text as it looks in the browser then you should use HTMLUNIT

User · Answer

To do so using NIO 2 powerful Files copy InputStream in  Path target    URL url   new URL   http   download me      Files copy  url openStream    Paths get  downloaded html

User · Answer

I used the actual answer to this post  url  and writing the output into a   file    package test   import java net    import java io     public class PDFTest       public static void main String   args  throws Exception       try           URL oracle   new URL  http   www fetagracollege org            BufferedReader in   new BufferedReader new InputStreamReader oracle openStream               String fileName    D   a 01  output txt            PrintWriter writer   new PrintWriter fileName   UTF-8            OutputStream outputStream   new FileOutputStream fileName           String inputLine           while   inputLine   in readLine       null                System out println inputLine               writer println inputLine                     in close              catch Exception e

User · Answer

Well  you could go with the built-in libraries such as URL and URLConnection  but they don t give very much control   Personally I d go with the Apache HTTPClient library  Edit  HTTPClient has been set to end of life by Apache  The replacement is  HTTP Components

[java] How do you Programmatically Download a Webpage in Java

Examples related to java

Examples related to http

Examples related to compression