How to avoid HTTP error 429 Too Many Requests python

Question

I am trying to use Python to login to a website and gather information from several webpages and I get the following error    Traceback  most recent call last     File  extract test py   line 43  in  lt module gt      response br open v    File   usr local lib python2 7 dist-packages mechanize  mechanize py   line 203  in open     return self  mech open url  data  timeout timeout    File   usr local lib python2 7 dist-packages mechanize  mechanize py   line 255  in  mech open     raise response mechanize  response httperror seek wrapper  HTTP Error 429  Unknown Response Code    I used time sleep   and it works  but it seems unintelligent and unreliable  is there any other way to dodge this error   Here s my code   import mechanize import cookielib import re first   example com page1   second   example com page2   third   example com page3   fourth   example com page4      I have seven URL s I want to open  urls list  first second third fourth   br   mechanize Browser     Cookie Jar cj   cookielib LWPCookieJar   br set cookiejar cj     Browser options  br set handle equiv True  br set handle redirect True  br set handle referer True  br set handle robots False     Log in credentials br open  example com   br select form nr 0  br  username      username  br  password      password  br submit    for url in urls list          br open url          print re findall  Some String

User · Answer

I ve found out a nice workaround to IP blocking when scraping sites  It lets you run a Scraper indefinitely by running it from Google App Engine and redeploying it automatically when you get a 429  Check out this article

User · Answer

Receiving a status 429 is not an error  it is the other server  kindly  asking you to please stop spamming requests  Obviously  your rate of requests has been too high and the server is not willing to accept this   You should not seek to  dodge  this  or even try to circumvent server security settings by trying to spoof your IP  you should simply respect the server s answer by not sending too many requests   If everything is set up properly  you will also have received a  Retry-after  header along with the 429 response  This header specifies the number of seconds you should wait before making another call  The proper way to deal with this  problem  is to read this header and to sleep your process for that many seconds   You can find more information on status 429 here  http   tools ietf org html rfc6585 page-3

User · Answer

As MRA said  you shouldn t try to dodge a 429 Too Many Requests but instead handle it accordingly  You have several options depending on your use-case   1  Sleep your process  The server usually includes a Retry-after header in the response with the number of seconds you are supposed to wait before retrying  Keep in mind that sleeping a process might cause problems  e g  in a task queue  where you should instead retry the task at a later time to free up the worker for other things   2  Exponential backoff  If the server does not tell you how long to wait  you can retry your request using increasing pauses in between  The popular task queue Celery has this feature built right-in   3  Token bucket  This technique is useful if you know in advance how many requests you are able to make in a given time  Each time you access the API you first fetch a token from the bucket  The bucket is refilled at a constant rate  If the bucket is empty  you know you ll have to wait before hitting the API again  Token buckets are usually implemented on the other end  the API  but you can also use them as a proxy to avoid ever getting a 429 Too Many Requests  Celery s rate limit feature uses a token bucket algorithm   Here is an example of a Python Celery app using exponential backoff and rate-limiting token bucket   class TooManyRequests Exception      Too many requests      task     rate limit  10 s      autoretry for  ConnectTimeout  TooManyRequests       retry backoff True  def api  args    kwargs     r   requests get  placeholder-external-api      if r status code    429      raise TooManyRequests

User · Answer

Another workaround would be to spoof your IP using some sort of Public  VPN or Tor network  This would be assuming the rate-limiting on the server at IP level   There is a brief blog post demonstrating a way to use tor along with urllib2   http   blog flip-edesign com  p 119

User · Answer

if response status code    429    time sleep int response headers  quot Retry-After quot

User · Answer

Writing this piece of code fixed my problem   requests get link  headers     User-agent    your bot 0 1

User · Answer

In many cases  continuing to scrape data from a website even when the server is requesting you not to is unethical  However  in the cases where it isn t  you can utilize a list of public proxies in order to scrape a website with many different IP addresses

[python] How to avoid HTTP error 429 (Too Many Requests) python

Examples related to python

Examples related to http

Examples related to mechanize

Examples related to http-status-code-429