[api] What's the best way of scraping data from a website?

I need to extract contents from a website, but the application doesn’t provide any application programming interface or another mechanism to access that data programmatically.

I found a useful third-party tool called Import.io that provides click and go functionality for scraping web pages and building data sets, the only thing is I want to keep my data locally and I don't want to subscribe to any subscription plans.

What kind of technique does this company use for scraping the web pages and building their datasets? I found some web scraping frameworks pjscrape & Scrapy could they provide such a feature

This question is related to api web-scraping screen-scraping

The answer is


Yes you can do it yourself. It is just a matter of grabbing the sources of the page and parsing them the way you want.

There are various possibilities. A good combo is using python-requests (built on top of urllib2, it is urllib.request in Python3) and BeautifulSoup4, which has its methods to select elements and also permits CSS selectors:

import requests
from BeautifulSoup4 import BeautifulSoup as bs
request = requests.get("http://foo.bar")
soup = bs(request.text) 
some_elements = soup.find_all("div", class_="myCssClass")

Some will prefer xpath parsing or jquery-like pyquery, lxml or something else.

When the data you want is produced by some JavaScript, the above won't work. You either need python-ghost or Selenium. I prefer the latter combined with PhantomJS, much lighter and simpler to install, and easy to use:

from selenium import webdriver
client = webdriver.PhantomJS()
client.get("http://foo")
soup = bs(client.page_source)

I would advice to start your own solution. You'll understand Scrapy's benefits doing so.

ps: take a look at scrapely: https://github.com/scrapy/scrapely

pps: take a look at Portia, to start extracting information visually, without programming knowledge: https://github.com/scrapinghub/portia


Examples related to api

I am receiving warning in Facebook Application using PHP SDK Couldn't process file resx due to its being in the Internet or Restricted zone or having the mark of the web on the file Failed to load resource: the server responded with a status of 404 (Not Found) css Call another rest api from my server in Spring-Boot How to send custom headers with requests in Swagger UI? This page didn't load Google Maps correctly. See the JavaScript console for technical details How can I send a Firebase Cloud Messaging notification without use the Firebase Console? Allow Access-Control-Allow-Origin header using HTML5 fetch API How to send an HTTP request with a header parameter? Laravel 5.1 API Enable Cors

Examples related to web-scraping

Scraping: SSL: CERTIFICATE_VERIFY_FAILED error for http://en.wikipedia.org How to print an exception in Python 3? What should I use to open a url instead of urlopen in urllib3 Use Excel VBA to click on a button in Internet Explorer, when the button has no "name" associated How to use Python requests to fake a browser visit a.k.a and generate User Agent? Scraping data from website using vba Using BeautifulSoup to extract text without tags Is it ok to scrape data from Google results? What's the best way of scraping data from a website? Use getElementById on HTMLElement instead of HTMLDocument

Examples related to screen-scraping

What's the best way of scraping data from a website? Executing Javascript from Python Can scrapy be used to scrape dynamic content from websites that are using AJAX? How I can get web page's content and save it into the string variable How do I prevent site scraping? Web scraping with Python