Python Web Crawlers and getting html source code

Question

So my brother wanted me to write a web crawler in Python  self-taught  and I know C    Java  and a bit of html   I m using version 2 7 and reading the python library  but I have a few problems 1  httplib HTTPConnection and request concept to me is new and I don t understand if it downloads an html script like cookie or an instance   If you do both of those  do you get the source for a website page  And what are some words that I would need to know to modify the page and return the modified page   Just for background  I need to download a page and replace any img with ones I have  And it would be nice if you guys could tell me your opinion of 2 7 and 3 1

User · Answer

The first thing you need to do is read the HTTP spec which will explain what you can expect to receive over the wire. The data returned inside the content will be the "rendered" web page, not the source. The source could be a JSP, a servlet, a CGI script, in short, just about anything, and you have no access to that. You only get the HTML that the server sent you. In the case of a static HTML page, then yes, you will be seeing the "source". But for anything else you see the generated HTML, not the source.

When you say modify the page and return the modified page what do you mean?

User · Answer

If you are using Python  gt  3 x you don t need to install any libraries  this is directly built in the python framework  The old urllib2 package has been renamed to urllib   from urllib import request  response   request urlopen  https   www google com     set the correct charset below page source   response read   decode  utf-8   print page source

User · Answer

An Example with python3 and the requests library as mentioned by  leoluk   pip install requests   Script req py   import requests  url  http   localhost     in case you need a session cd      sessionid    123      r   requests get url  cookies cd    or without a session  r   requests get url  r content   Now execute it and you will get the html source of localhost   python3 req py

User · Answer

Use Python 2 7  is has more 3rd party libs at the moment   Edit  see below    I recommend you using the stdlib module urllib2  it will allow you to comfortably get web resources  Example   import urllib2  response   urllib2 urlopen  http   google de   page source   response read     For parsing the code  have a look at BeautifulSoup   BTW  what exactly do you want to do      Just for background  I need to download a page and replace any img with ones I have   Edit  It s 2014 now  most of the important libraries have been ported  and you should definitely use Python 3 if you can  python-requests is a very nice high-level library which is easier to use than urllib2

[python] Python Web Crawlers and "getting" html source code

Examples related to python

Examples related to get

Examples related to web-crawler