I was trying to scrap a website for practice, but I kept on getting the HTTP Error 403 (does it think I'm a bot)?
Here is my code:
#import requests
import urllib.request
from bs4 import BeautifulSoup
#from urllib import urlopen
import re
webpage = urllib.request.urlopen('http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1').read
findrows = re.compile('<tr class="- banding(?:On|Off)>(.*?)</tr>')
findlink = re.compile('<a href =">(.*)</a>')
row_array = re.findall(findrows, webpage)
links = re.finall(findlink, webpate)
print(len(row_array))
iterator = []
The error I get is:
File "C:\Python33\lib\urllib\request.py", line 160, in urlopen
return opener.open(url, data, timeout)
File "C:\Python33\lib\urllib\request.py", line 479, in open
response = meth(req, response)
File "C:\Python33\lib\urllib\request.py", line 591, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python33\lib\urllib\request.py", line 517, in error
return self._call_chain(*args)
File "C:\Python33\lib\urllib\request.py", line 451, in _call_chain
result = func(*args)
File "C:\Python33\lib\urllib\request.py", line 599, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
This question is related to
python
http
web
http-status-code-403
Based on previous answers this has worked for me with Python 3.7
from urllib.request import Request, urlopen
req = Request('Url_Link', headers={'User-Agent': 'XYZ/3.0'})
webpage = urlopen(req, timeout=10).read()
print(webpage)
If you feel guilty about faking the user-agent as Mozilla (comment in the top answer from Stefano), it could work with a non-urllib User-Agent as well. This worked for the sites I reference:
req = urlrequest.Request(link, headers={'User-Agent': 'XYZ/3.0'})
urlrequest.urlopen(req, timeout=10).read()
My application is to test validity by scraping specific links that I refer to, in my articles. Not a generic scraper.
"This is probably because of mod_security or some similar server security feature which blocks known
spider/bot
user agents (urllib uses something like python urllib/3.3.0, it's easily detected)" - as already mentioned by Stefano Sanfilippo
from urllib.request import Request, urlopen
url="https://stackoverflow.com/search?q=html+error+403"
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
web_byte = urlopen(req).read()
webpage = web_byte.decode('utf-8')
The web_byte is a byte object returned by the server and the content type present in webpage is mostly utf-8. Therefore you need to decode web_byte using decode method.
This solves complete problem while I was having trying to scrap from a website using PyCharm
P.S -> I use python 3.4
You can try in two ways. The detail is in this link.
1) Via pip
pip install --upgrade certifi
2) If it doesn't work, try to run a Cerificates.command that comes bundled with Python 3.* for Mac:(Go to your python installation location and double click the file)
open /Applications/Python\ 3.*/Install\ Certificates.command
Based on the previous answer,
from urllib.request import Request, urlopen
#specify url
url = 'https://xyz/xyz'
req = Request(url, headers={'User-Agent': 'XYZ/3.0'})
response = urlopen(req, timeout=20).read()
This worked for me by extending the timeout.
Definitely it's blocking because of your use of urllib based on the user agent. This same thing is happening to me with OfferUp. You can create a new class called AppURLopener which overrides the user-agent with Mozilla.
import urllib.request
class AppURLopener(urllib.request.FancyURLopener):
version = "Mozilla/5.0"
opener = AppURLopener()
response = opener.open('http://httpbin.org/user-agent')
Since the page works in browser and not when calling within python program, it seems that the web app that serves that url recognizes that you request the content not by the browser.
Demonstration:
curl --dump-header r.txt http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1
...
<HTML><HEAD>
<TITLE>Access Denied</TITLE>
</HEAD><BODY>
<H1>Access Denied</H1>
You don't have permission to access ...
</HTML>
and the content in r.txt has status line:
HTTP/1.1 403 Forbidden
Try posting header 'User-Agent' which fakes web client.
NOTE: The page contains Ajax call that creates the table you probably want to parse. You'll need to check the javascript logic of the page or simply using browser debugger (like Firebug / Net tab) to see which url you need to call to get the table's content.
Source: Stackoverflow.com