Web-scraping JavaScript page with Python

Question

I m trying to develop a simple web scraper  I want to extract text without the HTML code  In fact  I achieve this goal  but I have seen that in some pages where JavaScript is loaded I didn t obtain good results   For example  if some JavaScript code adds some text  I can t see it  because when I call   response   urllib2 urlopen request    I get the original text without the added one  because JavaScript is executed in the client    So  I m looking for some ideas to solve this problem

User · Answer

EDIT 30 Dec 2017  This answer appears in top results of Google searches  so I decided to update it  The old answer is still at the end   dryscape isn t maintained anymore and the library dryscape developers recommend is Python 2 only  I have found using Selenium s python library with Phantom JS as a web driver fast enough and easy to get the work done   Once you have installed Phantom JS  make sure the phantomjs binary is available in the current path   phantomjs --version   result  2 1 1   Example  To give an example  I created a sample page with following HTML code   link     lt  DOCTYPE html gt   lt html gt   lt head gt     lt meta charset  utf-8  gt     lt title gt Javascript scraping test lt  title gt   lt  head gt   lt body gt     lt p id  intro-text  gt No javascript support lt  p gt     lt script gt       document getElementById  intro-text   innerHTML    Yay  Supports javascript      lt  script gt    lt  body gt   lt  html gt    without javascript it says  No javascript support and with javascript  Yay  Supports javascript  Scraping without JS support   import requests from bs4 import BeautifulSoup response   requests get my url  soup   BeautifulSoup response text  soup find id  intro-text     Result   lt p id  intro-text  gt No javascript support lt  p gt    Scraping with JS support   from selenium import webdriver driver   webdriver PhantomJS   driver get my url  p element   driver find element by id id   intro-text   print p element text    result   Yay  Supports javascript      You can also use Python library dryscrape to scrape javascript driven websites    Scraping with JS support   import dryscrape from bs4 import BeautifulSoup session   dryscrape Session   session visit my url  response   session body   soup   BeautifulSoup response  soup find id  intro-text     Result   lt p id  intro-text  gt Yay  Supports javascript lt  p gt

User · Answer

We are not getting the correct results because any javascript generated content needs to be rendered on the DOM  When we fetch an HTML page  we fetch the initial  unmodified by javascript  DOM   Therefore we need to render the javascript content before we crawl the page   As selenium is already mentioned many times in this thread  and how slow it gets sometimes was mentioned also   I will list two other possible solutions     Solution 1  This is a very nice tutorial on how to use Scrapy to crawl javascript generated content and we are going to follow just that   What we will need    Docker installed in our machine  This is a plus over other solutions until this point  as it utilizes an OS-independent platform  Install Splash following the instruction listed for our corresponding OS Quoting from splash documentation      Splash is a javascript rendering service  It   s a lightweight web browser with an HTTP API  implemented in Python 3 using Twisted and QT5     Essentially we are going to use Splash to render Javascript generated content  Run the splash server  sudo docker run -p 8050 8050 scrapinghub splash  Install the scrapy-splash plugin  pip install scrapy-splash Assuming that we already have a Scrapy project created  if not  let s make one   we will follow the guide and update the settings py      Then go to your scrapy project   s settings py and set these middlewares   DOWNLOADER MIDDLEWARES            scrapy splash SplashCookiesMiddleware   723         scrapy splash SplashMiddleware   725         scrapy downloadermiddlewares httpcompression HttpCompressionMiddleware   810          The URL of the Splash server if you   re using Win or OSX this should be the URL of the docker machine  How to get a Docker container  39 s IP address from the host     SPLASH URL    http   localhost 8050        And finally you need to set these values too   DUPEFILTER CLASS    scrapy splash SplashAwareDupeFilter  HTTPCACHE STORAGE    scrapy splash SplashAwareFSCacheStorage    Finally  we can use a SplashRequest      In a normal spider you have Request objects which you can use to open URLs  If the page you want to open contains JS generated data you have to use SplashRequest or SplashFormRequest  to render the page  Here   s a simple example   class MySpider scrapy Spider       name    jsscraper      start urls     http   quotes toscrape com js         def start requests self           for url in self start urls          yield SplashRequest              url url  callback self parse  endpoint  render html                 def parse self  response           for q in response css  div quote            quote   QuoteItem           quote  author     q css   author  text   extract first           quote  quote     q css   text  text   extract first           yield quote       SplashRequest renders the URL as html and returns the response which you can use in the callback parse  method       Solution 2  Let s call this experimental at the moment  May 2018     This solution is for Python s version 3 6 only  at the moment    Do you know the requests module  well who doesn t   Now it has a web crawling little sibling  requests-HTML      This library intends to make parsing HTML  e g  scraping the web  as simple and intuitive as possible     Install requests-html  pipenv install requests-html Make a request to the page s url   from requests html import HTMLSession  session   HTMLSession   r   session get a page url   Render the response to get the Javascript generated bits   r html render      Finally  the module seems to offer scraping capabilities  Alternatively  we can try the well-documented way of using BeautifulSoup with the r html object we just rendered

User · Answer

I ve been trying to find answer to this questions for two days  Many answers direct you to different issues  But serpentr s answer above is really to the point  It is the shortest  simplest solution  Just a reminder the last word  var  represents the variable name  so should be used as    result   driver execute script  var text   document title   return text

User · Answer

Maybe selenium can do it   from selenium import webdriver import time  driver   webdriver Firefox   driver get url  time sleep 5  htmlSource   driver page source

User · Answer

You ll want to use urllib  requests  beautifulSoup and selenium web driver in your script for different parts of the page   to name a few   Sometimes you ll get what you need with just one of these modules  Sometimes you ll need two  three  or all of these modules  Sometimes you ll need to switch off the js on your browser  Sometimes you ll need header info in your script  No websites can be scraped the same way and no website can be scraped in the same way forever without having to modify your crawler  usually after a few months  But they can all be scraped  Where there s a will there s a way for sure  If you need scraped data continuously into the future just scrape everything you need and store it in  dat files with pickle  Just keep searching how to try what with these modules and copying and pasting your errors into the Google

User · Answer

Selenium is the best for scraping JS and Ajax content   Check this article for extracting data from the web using Python    pip install selenium   Then download Chrome webdriver   from selenium import webdriver  browser   webdriver Chrome    browser get  https   www python org     nav   browser find element by id  mainnav    print nav text    Easy  right

User · Answer

This seems to be a good solution also  taken from a great blog post  import sys   from PyQt4 QtGui import     from PyQt4 QtCore import     from PyQt4 QtWebKit import     from lxml import html    Take this class for granted Just use result of rendering  class Render QWebPage       def   init   self  url         self app   QApplication sys argv        QWebPage   init   self        self loadFinished connect self  loadFinished        self mainFrame   load QUrl url         self app exec         def  loadFinished self  result         self frame   self mainFrame         self app quit      url    http   pycoders com archive     r   Render url    result   r frame toHtml     This step is important Converting QString to Ascii for lxml to process    The following returns an lxml element tree archive links   html fromstring str result toAscii     print archive links    The following returns an array containing the URLs raw links   archive links xpath    div  class  campaign   a  href   print raw links

User · Answer

A mix of BeautifulSoup and Selenium works very well for me   from selenium import webdriver from selenium webdriver common by import By from selenium webdriver support ui import WebDriverWait from selenium webdriver support import expected conditions as EC from bs4 import BeautifulSoup as bs  driver   webdriver Firefox   driver get  http   somedomain url that delays loading       try          element   WebDriverWait driver  10  until          EC presence of element located  By ID   myDynamicElement      waits 10 seconds until element is located  Can have other wait conditions  such as visibility of element located or text to be present in element          html   driver page source         soup   bs html   lxml           dynamic text   soup find all  p     class   class name     or other attributes  optional     else          print  Couldnt locate element     P S  You can find more wait conditions here

User · Answer

As mentioned  Selenium is a good choice for rendering the results of the JavaScript  from selenium webdriver import Firefox from selenium webdriver firefox options import Options  options   Options   options headless   True browser   Firefox executable path  quot  usr local bin geckodriver quot   options options   url    quot https   www example com quot  browser get url   And gazpacho is a really easy library to parse over the rendered html  from gazpacho import Soup  soup   Soup browser page source  soup find  quot a quot   attrs  href

User · Answer

I personally prefer using scrapy and selenium and dockerizing both in separate containers  This way you can install both with minimal hassle and crawl modern websites that almost all contain javascript in one form or another  Here s an example   Use the scrapy startproject to create your scraper and write your spider  the skeleton can be as simple as this   import scrapy   class MySpider scrapy Spider       name    my spider      start urls     https   somewhere com        def start requests self           yield scrapy Request url self start urls 0         def parse self  response              do stuff with results  scrape items etc            now were just checking everything worked          print response body    The real magic happens in the middlewares py  Overwrite two methods in the downloader middleware     init   and  process request  in the following way     import some additional modules that we need import os from copy import deepcopy from time import sleep  from scrapy import signals from scrapy http import HtmlResponse from selenium import webdriver  class SampleProjectDownloaderMiddleware object    def   init   self       SELENIUM LOCATION   os environ get  SELENIUM LOCATION    NOT HERE       SELENIUM URL   f http    SELENIUM LOCATION  4444 wd hub      chrome options   webdriver ChromeOptions          chrome options add experimental option  mobileEmulation   mobile emulation      self driver   webdriver Remote command executor SELENIUM URL                                     desired capabilities chrome options to capabilities      def process request self  request  spider        self driver get request url         sleep a bit so the page has time to load       or monitor items on page to continue as soon as page ready     sleep 4         if you need to manipulate the page content like clicking and scrolling  you do it here       self driver find element by css selector   my-class   click          you only need the now properly and completely rendered html from your page to get results     body   deepcopy self driver page source         copy the current url in case of redirects     url   deepcopy self driver current url       return HtmlResponse url  body body  encoding  utf-8   request request    Dont forget to enable this middlware by uncommenting the next lines in the settings py file   DOWNLOADER MIDDLEWARES      sample project middlewares SampleProjectDownloaderMiddleware   543     Next for dockerization  Create your Dockerfile from a lightweight image  I m using python Alpine here   copy your project directory to it  install requirements     Use an official Python runtime as a parent image FROM python 3 6-alpine    install some packages necessary to scrapy and then curl because it s  handy for debugging RUN apk --update add linux-headers libffi-dev openssl-dev build-base libxslt-dev libxml2-dev curl python-dev  WORKDIR  my scraper  ADD requirements txt  my scraper   RUN pip install -r requirements txt  ADD    scrapers   And finally bring it all together in docker-compose yaml   version   2  services    selenium      image  selenium standalone-chrome     ports        -  4444 4444      shm size  1G    my scraper      build        depends on        -  selenium      environment        - SELENIUM LOCATION samplecrawler selenium 1     volumes        -    my scraper       use this command to keep the container running     command  tail -f  dev null   Run docker-compose up -d  If you re doing this the first time it will take a while for it to fetch the latest selenium standalone-chrome and the build your scraper image as well    Once it s done  you can check that your containers are running with docker ps and also check that the name of the selenium container matches that of the environment variable that we passed to our scraper container  here  it was SELENIUM LOCATION samplecrawler selenium 1     Enter your scraper container with docker exec -ti YOUR CONTAINER NAME sh   the command for me was docker exec -ti samplecrawler my scraper 1 sh  cd into the right directory and run your scraper with scrapy crawl my spider   The entire thing is on my github page and you can get it from here

User · Answer

You can also execute javascript using webdriver   from selenium import webdriver  driver   webdriver Firefox   driver get url  driver execute script  document title     or store the value in a variable  result   driver execute script  var text   document title   return var

User · Answer

I recently used requests html library to solve this problem  Their expanded documentation at readthedocs io is pretty good  skip the annotated version at pypi org   If your use case is basic  you are likely to have some success  from requests html import HTMLSession session   HTMLSession   response   session request method  quot get quot  url  quot www google com  quot   response html render    If you are having trouble rendering the data you need with response html render    you can pass some javascript to the render function to render the particular js object you need  This is copied from their docs  but it might be just what you need   If script is specified  it will execute the provided JavaScript at runtime  Example   script    quot  quot  quot           gt            return               width  document documentElement clientWidth              height  document documentElement clientHeight              deviceScaleFactor  window devicePixelRatio                    quot  quot  quot    Returns the return value of the executed script  if any is provided    gt  gt  gt  response html render script script    width   800   height   600   deviceScaleFactor   1   In my case  the data I wanted were the arrays that populated a javascript plot but the data wasn t getting rendered as text anywhere in the html  Sometimes its not clear at all what the object names are of the data you want if the data is populated dynamically  If you can t track down the js objects directly from view source or inspect  you can type in  quot window quot  followed by ENTER in the debugger console in the browser  Chrome  to pull up a full list of objects rendered by the browser  If you make a few educated guesses about where the data is stored  you might have some luck finding it there  My graph data was under window view data in the console  so in the  quot script quot  variable passed to the  render   method quoted above  I used  return       data  window view data

User · Answer

Using PyQt5  from PyQt5 QtWidgets import QApplication from PyQt5 QtCore import QUrl from PyQt5 QtWebEngineWidgets import QWebEnginePage import sys import bs4 as bs import urllib request   class Client QWebEnginePage       def   init   self url           global app         self app   QApplication sys argv          QWebEnginePage   init   self          self html              self loadFinished connect self on load finished          self load QUrl url           self app exec         def on load finished self           self html   self toHtml self Callable          print  Load Finished        def Callable self data           self html   data         self app quit      url        client response   Client url    print client response html

User · Answer

If you have ever used the Requests module for python before  I recently found out that the developer created a new module called Requests-HTML which now also has the ability to render JavaScript   You can also visit https   html python-requests org  to learn more about this module  or if your only interested about rendering JavaScript then you can visit https   html python-requests org   javascript-support to directly learn how to use the module to render JavaScript using Python   Essentially  Once you correctly install the Requests-HTML module  the following example  which is shown on the above link  shows how you can use this module to scrape a website and render JavaScript contained within the website   from requests html import HTMLSession session   HTMLSession    r   session get  http   python-requests org     r html render    r html search  Python 2 will retire in only  months  months     months      lt time gt 25 lt  time gt    This is the result    I recently learnt about this from a YouTube video  Click Here  to watch the YouTube video  which demonstrates how the module works

User · Answer

It sounds like the data you re really looking for can be accessed via secondary URL called by some javascript on the primary page   While you could try running javascript on the server to handle this  a simpler approach  to might be to load up the page using Firefox and use a tool like Charles or Firebug to identify exactly what that secondary URL is  Then you can just query that URL directly for the data you are interested in

[python] Web-scraping JavaScript page with Python

Example

Scraping without JS support:

Scraping with JS support:

Scraping with JS support:

Examples related to python

Examples related to web-scraping

Examples related to python-2.x

Examples related to urlopen