[python] How to read html from a url in python 3

I looked at previous similar questions and got only more confused.

In python 3.4, I want to read an html page as a string, given the url.

In perl I do this with LWP::Simple, using get().

A matplotlib 1.3.1 example says: import urllib; u1=urllib.urlretrieve(url). python3 can't find urlretrieve.

I tried u1 = urllib.request.urlopen(url), which appears to get an HTTPResponse object, but I can't print it or get a length on it or index it.

u1.body doesn't exist. I can't find a description of the HTTPResponse in python3.

Is there an attribute in the HTTPResponse object which will give me the raw bytes of the html page?

(Irrelevant stuff from other questions include urllib2, which doesn't exist in my python, csv parsers, etc.)

Edit:

I found something in a prior question which partially (mostly) does the job:

u2 = urllib.request.urlopen('http://finance.yahoo.com/q?s=aapl&ql=1')

for lines in u2.readlines():
    print (lines)

I say 'partially' because I don't want to read separate lines, but just one big string.

I could just concatenate the lines, but every line printed has a character 'b' prepended to it.

Where does that come from?

Again, I suppose I could delete the first character before concatenating, but that does get to be a kloodge.

This question is related to python html url

The answer is


urllib.request.urlopen(url).read() should return you the raw HTML page as a string.


Note that Python3 does not read the html code as a string but as a bytearray, so you need to convert it to one with decode.

import urllib.request

fp = urllib.request.urlopen("http://www.python.org")
mybytes = fp.read()

mystr = mybytes.decode("utf8")
fp.close()

print(mystr)

Reading an html page with urllib is fairly simple to do. Since you want to read it as a single string I will show you.

Import urllib.request:

#!/usr/bin/python3.5

import urllib.request

Prepare our request

request = urllib.request.Request('http://www.w3schools.com')

Always use a "try/except" when requesting a web page as things can easily go wrong. urlopen() requests the page.

try:
    response = urllib.request.urlopen(request)
except:
    print("something wrong")

Type is a great function that will tell us what 'type' a variable is. Here, response is a http.response object.

print(type(response))

The read function for our response object will store the html as bytes to our variable. Again type() will verify this.

htmlBytes = response.read()

print(type(htmlBytes))

Now we use the decode function for our bytes variable to get a single string.

htmlStr = htmlBytes.decode("utf8")

print(type(htmlStr))

If you do want to split up this string into separate lines, you can do so with the split() function. In this form we can easily iterate through to print out the entire page or do any other processing.

htmlSplit = htmlStr.split('\n')

print(type(htmlSplit))

for line in htmlSplit:
    print(line)

Hopefully this provides a little more detailed of an answer. Python documentation and tutorials are great, I would use that as a reference because it will answer most questions you might have.


For python 2

import urllib
some_url = 'https://docs.python.org/2/library/urllib.html'
filehandle = urllib.urlopen(some_url)
print filehandle.read()

Try the 'requests' module, it's much simpler.

#pip install requests for installation

import requests

url = 'https://www.google.com/'
r = requests.get(url)
r.text

more info here > http://docs.python-requests.org/en/master/


import requests

url = requests.get("http://yahoo.com")
htmltext = url.text
print(htmltext)

This will work similar to urllib.urlopen.


Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to html

Embed ruby within URL : Middleman Blog Please help me convert this script to a simple image slider Generating a list of pages (not posts) without the index file Why there is this "clear" class before footer? Is it possible to change the content HTML5 alert messages? Getting all files in directory with ajax DevTools failed to load SourceMap: Could not load content for chrome-extension How to set width of mat-table column in angular? How to open a link in new tab using angular? ERROR Error: Uncaught (in promise), Cannot match any routes. URL Segment

Examples related to url

What is the difference between URL parameters and query strings? Allow Access-Control-Allow-Origin header using HTML5 fetch API File URL "Not allowed to load local resource" in the Internet Browser Slack URL to open a channel from browser Getting absolute URLs using ASP.NET Core How do I load an HTTP URL with App Transport Security enabled in iOS 9? Adding form action in html in laravel React-router urls don't work when refreshing or writing manually URL for public Amazon S3 bucket How can I append a query parameter to an existing URL?