Can scrapy be used to scrape dynamic content from websites that are using AJAX

Question

I have recently been learning Python and am dipping my hand into building a web-scraper   It s nothing fancy at all  its only purpose is to get the data off of a betting website and have this data put into Excel   Most of the issues are solvable and I m having a good little mess around  However I m hitting a massive hurdle over one issue  If a site loads a table of horses and lists current betting prices this information is not in any source file  The clue is that this data is live sometimes  with the numbers being updated obviously from some remote server  The HTML on my PC simply has a hole where their servers are pushing through all the interesting data that I need   Now my experience with dynamic web content is low  so this thing is something I m having trouble getting my head around    I think Java or Javascript is a key  this pops up often    The scraper is simply a odds comparison engine   Some sites have APIs but I need this for those that don t  I m using the scrapy library with Python 2 7  I do apologize if this question is too open-ended  In short  my question is  how can scrapy be used to scrape this dynamic data so that I can use it   So that I can scrape this betting odds data in real-time

User · Accepted Answer

Webkit based browsers  like Google Chrome or Safari  has built-in developer tools  In Chrome you can open it Menu- gt Tools- gt Developer Tools  The Network tab allows you to see all information about every request and response     In the bottom of the picture you can see that I ve filtered request down to XHR - these are requests made by javascript code   Tip  log is cleared every time you load a page  at the bottom of the picture  the black dot button will preserve log   After analyzing requests and responses you can simulate these requests from your web-crawler and extract valuable data  In many cases it will be easier to get your data than parsing HTML  because that data does not contain presentation logic and is formatted to be accessed by javascript code   Firefox has similar extension  it is called firebug  Some will argue that firebug is even more powerful but I like the simplicity of webkit

User · Answer

Here is a simple example of  scrapy with an AJAX request  Let see the site rubin-kazan ru   All messages are loaded with an AJAX request  My goal is to fetch these messages with all their attributes  author  date           When I analyze the source code of the page I can t see all these messages because the web page uses AJAX technology  But I can with Firebug from Mozilla Firefox  or an equivalent tool in other browsers  to analyze the HTTP request that generate the messages on the web page     It doesn t reload the whole page but only the parts of the page that contain messages  For this purpose I click an arbitrary number of page on the bottom     And I observe the HTTP request that is responsible for message body     After finish  I analyze the headers of the request  I must quote that this URL I ll extract from source page from var section  see the code below      And the form data content of the request  the HTTP method is  Post       And the content of response  which is a JSON file     Which presents all the information I m looking for   From now  I must implement all this knowledge in scrapy  Let s define the spider for this purpose   class spider BaseSpider       name    RubiGuesst      start urls     http   www rubin-kazan ru guestbook html        def parse self  response           url list gb messages   re search r url list gb messages          response body  group 1          yield FormRequest  http   www rubin-kazan ru    url list gb messages  callback self RubiGuessItem                            formdata   page   str page   1    uid             def RubiGuessItem self  response           json file   response body   In parse function I have the response for first request  In RubiGuessItem I have the JSON file with all information

User · Answer

yes  Scrapy can scrap dynamic websites  website that are rendered through javaScript   There are Two approaches to scrapy these kind of websites   First   you can use splash to render Javascript code and then parse the rendered HTML  you can find the doc and project here Scrapy splash  git  Second    As everyone is stating  by monitoring the network calls  yes  you can find the api call that fetch the data and mock that call in your scrapy spider might help you to get desired data

User · Answer

Many times when crawling we run into problems where content that is rendered on the page is generated with Javascript and therefore scrapy is unable to crawl for it  eg  ajax requests  jQuery craziness    However  if you use Scrapy along with the web testing framework Selenium then we are able to crawl anything displayed in a normal web browser   Some things to note    You must have the Python version of Selenium RC installed for this to work  and you must have set up Selenium properly   Also this is just a template crawler   You could get much crazier and more advanced with things but I just wanted to show the basic idea   As the code stands now you will be doing two requests for any given url   One request is made by Scrapy and the other is made by Selenium   I am sure there are ways around this so that you could possibly just make Selenium do the one and only request but I did not bother to implement that and by doing two requests you get to crawl the page with Scrapy too  This is quite powerful because now you have the entire rendered DOM available for you to crawl and you can still use all the nice crawling features in Scrapy   This will make for slower crawling of course but depending on how much you need the rendered DOM it might be worth the wait   from scrapy contrib spiders import CrawlSpider  Rule from scrapy contrib linkextractors sgml import SgmlLinkExtractor from scrapy selector import HtmlXPathSelector from scrapy http import Request  from selenium import selenium  class SeleniumSpider CrawlSpider       name    SeleniumSpider      start urls     http   www domain com        rules             Rule SgmlLinkExtractor allow     html       callback  parse page  follow True              def   init   self           CrawlSpider   init   self          self verificationErrors              self selenium   selenium  localhost   4444    chrome    http   www domain com           self selenium start        def   del   self           self selenium stop           print self verificationErrors         CrawlSpider   del   self       def parse page self  response           item   Item            hxs   HtmlXPathSelector response           Do some XPath selection with Scrapy         hxs select    div   extract            sel   self selenium         sel open response url            Wait for javscript to load in Selenium         time sleep 2 5            Do some crawling of javascript created content with Selenium         sel get text    div           yield item    Snippet imported from snippets scrapy org  which no longer works    author  wynbennett   date    Jun 21  2011    Reference  http   snipplr com view 66998

User · Answer

I handle the ajax request by using Selenium and the Firefox web driver  It is not that fast if you need the crawler as a daemon  but much better than any manual solution  I wrote a short tutorial here for reference

User · Answer

Another solution would be to implement a download handler or download handler middleware   see scrapy docs for more information on downloader middleware  The following is an example class using selenium with headless phantomjs webdriver    1  Define class within the middlewares py script   from selenium import webdriver from scrapy http import HtmlResponse  class JsDownload object         check spider middleware     def process request self  request  spider           driver   webdriver PhantomJS executable path  D  phantomjs exe           driver get request url          return HtmlResponse request url  encoding  utf-8   body driver page source encode  utf-8      2  Add JsDownload   class to variable DOWNLOADER MIDDLEWARE within settings py   DOWNLOADER MIDDLEWARES     MyProj middleware MiddleWareModule MiddleWareClass   500    3  Integrate the HTMLResponse within your spider py  Decoding the response body will get you the desired output   class Spider CrawlSpider         define unique name of spider     name    spider       start urls     https   www url de         def parse self  response             initialize items         item   CrawlerItem              store data as items         item  js enabled     response body decode  utf-8      Optional Addon   I wanted the ability to tell different spiders which middleware to use so I implemented this wrapper   def check spider middleware method    functools wraps method  def wrapper self  request  spider       msg      s  s middleware step     self   class     name         if self   class   in spider middleware          spider log msg    executing   level log DEBUG          return method self  request  spider      else          spider log msg    skipping   level log DEBUG          return None  return wrapper   for wrapper to work all spiders must have at minimum   middleware   set       to include a middleware   middleware   set  MyProj middleware ModuleName ClassName     Advantage   The main advantage to implementing it this way rather than in the spider is that you only end up making one request  In A T s solution for example  The download handler processes the request and then hands off the response to the spider  The spider then makes a brand new request in it s parse page function -- That s two requests for the same content

User · Answer

I was using a custom downloader middleware  but wasn t very happy with it  as I didn t manage to make the cache work with it   A better approach was to implement a custom download handler   There is a working example here  It looks like this     encoding  utf-8 from   future   import unicode literals  from scrapy import signals from scrapy signalmanager import SignalManager from scrapy responsetypes import responsetypes from scrapy xlib pydispatch import dispatcher from selenium import webdriver from six moves import queue from twisted internet import defer  threads from twisted python failure import Failure   class PhantomJSDownloadHandler object        def   init   self  settings           self options   settings get  PHANTOMJS OPTIONS                max run   settings get  PHANTOMJS MAXRUN   10          self sem   defer DeferredSemaphore max run          self queue   queue LifoQueue max run           SignalManager dispatcher Any  connect self  close  signal signals spider closed       def download request self  request  spider              use semaphore to guard a phantomjs pool            return self sem run self  wait request  request  spider       def  wait request self  request  spider           try              driver   self queue get nowait           except queue Empty              driver   webdriver PhantomJS   self options           driver get request url            ghostdriver won t response when switch window until page is loaded         dfd   threads deferToThread lambda  driver switch to window driver current window handle           dfd addCallback self  response  driver  spider          return dfd      def  response self     driver  spider           body   driver execute script  return document documentElement innerHTML           if body startswith   lt head gt  lt  head gt        cannot access response header in Selenium             body   driver execute script  return document documentElement textContent           url   driver current url         respcls   responsetypes from args url url  body body  100  encode  utf8            resp   respcls url url  body body  encoding  utf-8            response failed   getattr spider   response failed   None          if response failed and callable response failed  and response failed resp  driver               driver close               return defer fail Failure            else              self queue put driver              return defer succeed resp       def  close self           while not self queue empty                driver   self queue get nowait               driver close     Suppose your scraper is called  scraper   If you put the mentioned code inside a file called handlers py on the root of the  scraper  folder  then you could add to your settings py   DOWNLOAD HANDLERS          http    scraper handlers PhantomJSDownloadHandler        https    scraper handlers PhantomJSDownloadHandler       And voil    the JS parsed DOM  with scrapy cache  retries  etc

User · Answer

how can scrapy be used to scrape this dynamic data so that I can use   it    I wonder why no one has posted the solution using Scrapy only    Check out the blog post from Scrapy team SCRAPING INFINITE SCROLLING PAGES   The example scraps http   spidyquotes herokuapp com scroll website which uses infinite scrolling    The idea is to use Developer Tools of your browser and notice the AJAX requests  then based on that information create the requests for Scrapy   import json import scrapy   class SpidyQuotesSpider scrapy Spider       name    spidyquotes      quotes base url    http   spidyquotes herokuapp com api quotes page  s      start urls    quotes base url   1      download delay   1 5      def parse self  response           data   json loads response body          for item in data get  quotes                    yield                    text   item get  text                     author   item get  author       get  name                     tags   item get  tags                          if data  has next                next page   data  page     1             yield scrapy Request self quotes base url   next page

[javascript] Can scrapy be used to scrape dynamic content from websites that are using AJAX?

Examples related to javascript

Examples related to python

Examples related to ajax

Examples related to screen-scraping

Examples related to scrapy