HTTP error 403 in Python 3 Web Scraping

Question

I was trying to scrap a website for practice  but I kept on getting the HTTP Error 403  does it think I m a bot    Here is my code    import requests import urllib request from bs4 import BeautifulSoup  from urllib import urlopen import re  webpage   urllib request urlopen  http   www cmegroup com trading products  sortField oi amp sortAsc false amp venues 3 amp page 1 amp cleared 1 amp group 1   read findrows   re compile   lt tr class  - banding   On Off  gt       lt  tr gt    findlink   re compile   lt a href    gt      lt  a gt     row array   re findall findrows  webpage  links   re finall findlink  webpate   print len row array    iterator        The error I get is    File  C  Python33 lib urllib request py   line 160  in urlopen     return opener open url  data  timeout    File  C  Python33 lib urllib request py   line 479  in open     response   meth req  response    File  C  Python33 lib urllib request py   line 591  in http response      http   request  response  code  msg  hdrs    File  C  Python33 lib urllib request py   line 517  in error     return self  call chain  args    File  C  Python33 lib urllib request py   line 451  in  call chain     result   func  args    File  C  Python33 lib urllib request py   line 599  in http error default     raise HTTPError req full url  code  msg  hdrs  fp  urllib error HTTPError  HTTP Error 403  Forbidden

User · Answer

Based on the previous answer    from urllib request import Request  urlopen         specify url url    https   xyz xyz  req   Request url  headers   User-Agent    XYZ 3 0    response   urlopen req  timeout 20  read     This  worked for me by extending the timeout

User · Answer

Since the page works in browser and not when calling within python program  it seems that the web app that serves that url recognizes that you request the content not by the browser   Demonstration   curl --dump-header r txt http   www cmegroup com trading products  sortField oi amp sortAsc false amp venues 3 amp page 1 amp cleared 1 amp group 1       lt HTML gt  lt HEAD gt   lt TITLE gt Access Denied lt  TITLE gt   lt  HEAD gt  lt BODY gt   lt H1 gt Access Denied lt  H1 gt  You don t have permission to access      lt  HTML gt    and the content in r txt has status line   HTTP 1 1 403 Forbidden   Try posting header  User-Agent  which fakes web client   NOTE  The page contains Ajax call that creates the table you probably want to parse  You ll need to check the javascript logic of the page or simply using browser debugger  like Firebug   Net tab  to see which url you need to call to get the table s content

User · Answer

Definitely it s blocking because of your use of urllib based on the user agent  This same thing is happening to me with OfferUp  You can create a new class called AppURLopener which overrides the user-agent with Mozilla    import urllib request  class AppURLopener urllib request FancyURLopener       version    Mozilla 5 0   opener   AppURLopener   response   opener open  http   httpbin org user-agent     Source

User · Answer

This is probably because of mod security or some similar server security feature which blocks known spider bot user agents  urllib uses something like python urllib 3 3 0  it s easily detected   Try setting a known browser user agent with   from urllib request import Request  urlopen  req   Request  http   www cmegroup com trading products  sortField oi amp sortAsc false amp venues 3 amp page 1 amp cleared 1 amp group 1   headers   User-Agent    Mozilla 5 0    webpage   urlopen req  read     This works for me   By the way  in your code you are missing the    after  read in the urlopen line  but I think that it s a typo   TIP  since this is exercise  choose a different  non restrictive site  Maybe they are blocking urllib for some reason

User · Answer

This is probably because of mod security or some similar server security feature which blocks known      spider bot   user agents  urllib uses something like python urllib 3 3 0  it s easily detected   - as already mentioned by Stefano Sanfilippo  from urllib request import Request  urlopen url  https   stackoverflow com search q html error 403  req   Request url  headers   User-Agent    Mozilla 5 0     web byte   urlopen req  read    webpage   web byte decode  utf-8     The web byte is a byte object returned by the server and the content type present in webpage is mostly utf-8  Therefore you need to decode web byte using decode method   This solves complete problem while I was having trying to scrap from a website using PyCharm   P S -  I use python 3 4

User · Answer

If you feel guilty about faking the user-agent as Mozilla  comment in the top answer from Stefano   it could work with a non-urllib User-Agent as well  This worked for the sites I reference       req   urlrequest Request link  headers   User-Agent    XYZ 3 0        urlrequest urlopen req  timeout 10  read     My application is to test validity by scraping specific links that I refer to  in my articles  Not a generic scraper

User · Answer

You can try in two ways  The detail is in this link    1  Via pip     pip install --upgrade certifi   2  If it doesn t work  try to run a Cerificates command that comes bundled with Python 3   for Mac  Go to your python installation location and double click the file      open  Applications Python  3   Install  Certificates command

User · Answer

Based on previous answers this has worked for me with Python 3 7  from urllib request import Request  urlopen  req   Request  Url Link   headers   User-Agent    XYZ 3 0    webpage   urlopen req  timeout 10  read    print webpage

[python] HTTP error 403 in Python 3 Web Scraping

Examples related to python

Examples related to http

Examples related to web

Examples related to http-status-code-403