What s the best way of scraping data from a website

Question

I need to extract contents from a website  but the application doesn   t provide any application programming interface or another mechanism to access that data programmatically   I found a useful third-party tool called Import io that provides click and go functionality for scraping web pages and building data sets  the only thing is I want to keep my data locally and I don t want to subscribe to any subscription plans   What kind of technique does this company use for scraping the web pages and building their datasets  I found some web scraping frameworks pjscrape  amp  Scrapy could they provide such a feature

User · Answer

Yes you can do it yourself  It is just a matter of grabbing the sources of the page and parsing them the way you want    There are various possibilities  A good combo is using python-requests  built on top of urllib2  it is urllib request in Python3  and BeautifulSoup4  which has its methods to select elements and also permits CSS selectors   import requests from BeautifulSoup4 import BeautifulSoup as bs request   requests get  http   foo bar   soup   bs request text   some elements   soup find all  div   class   myCssClass     Some will prefer xpath parsing or jquery-like pyquery  lxml or something else   When the data you want is produced by some JavaScript  the above won t work  You either need python-ghost or Selenium  I prefer the latter combined with PhantomJS  much lighter and simpler to install  and easy to use   from selenium import webdriver client   webdriver PhantomJS   client get  http   foo   soup   bs client page source    I would advice to start your own solution  You ll understand Scrapy s benefits doing so   ps  take a look at scrapely  https   github com scrapy scrapely  pps  take a look at Portia  to start extracting information visually  without programming knowledge  https   github com scrapinghub portia

User · Answer

You will definitely want to start with a good web scraping framework  Later on you may decide that they are too limiting and you can put together your own stack of libraries but without a lot of scraping experience your design will be much worse than pjscrape or scrapy   Note  I use the terms crawling and scraping basically interchangeable here  This is a copy of my answer to your Quora question  it s pretty long   Tools  Get very familiar with either Firebug or Chrome dev tools depending on your preferred browser  This will be absolutely necessary as you browse the site you are pulling data from and map out which urls contain the data you are looking for and what data formats make up the responses   You will need a good working knowledge of HTTP as well as HTML and will probably want to find a decent piece of man in the middle proxy software  You will need to be able to inspect HTTP requests and responses and understand how the cookies and session information and query parameters are being passed around  Fiddler  http   www telerik com fiddler  and Charles Proxy  http   www charlesproxy com   are popular tools  I use mitmproxy  http   mitmproxy org   a lot as I m more of a keyboard guy than a mouse guy   Some kind of console shell REPL type environment where you can try out various pieces of code with instant feedback will be invaluable  Reverse engineering tasks like this are a lot of trial and error so you will want a workflow that makes this easy   Language  PHP is basically out  it s not well suited for this task and the library framework support is poor in this area  Python  Scrapy is a great starting point  and Clojure Clojurescript  incredibly powerful and productive but a big learning curve  are great languages for this problem  Since you would rather not learn a new language and you already know Javascript I would definitely suggest sticking with JS  I have not used pjscrape but it looks quite good from a quick read of their docs  It s well suited and implements an excellent solution to the problem I describe below   A note on Regular expressions  DO NOT USE REGULAR EXPRESSIONS TO PARSE HTML  A lot of beginners do this because they are already familiar with regexes  It s a huge mistake  use xpath or css selectors to navigate html and only use regular expressions to extract data from actual text inside an html node  This might already be obvious to you  it becomes obvious quickly if you try it but a lot of people waste a lot of time going down this road for some reason  Don t be scared of xpath or css selectors  they are WAY easier to learn than regexes and they were designed to solve this exact problem   Javascript-heavy sites  In the old days you just had to make an http request and parse the HTML reponse  Now you will almost certainly have to deal with sites that are a mix of standard HTML HTTP request responses and asynchronous HTTP calls made by the javascript portion of the target site  This is where your proxy software and the network tab of firebug devtools comes in very handy  The responses to these might be html or they might be json  in rare cases they will be xml or something else   There are two approaches to this problem   The low level approach   You can figure out what ajax urls the site javascript is calling and what those responses look like and make those same requests yourself  So you might pull the html from http   example com foobar and extract one piece of data and then have to pull the json response from http   example com api baz foo b    to get the other piece of data  You ll need to be aware of passing the correct cookies or session parameters  It s very rare  but occasionally some required parameters for an ajax call will be the result of some crazy calculation done in the site s javascript  reverse engineering this can be annoying   The embedded browser approach   Why do you need to work out what data is in html and what data comes in from an ajax call  Managing all that session and cookie data  You don t have to when you browse a site  the browser and the site javascript do that  That s the whole point   If you just load the page into a headless browser engine like phantomjs it will load the page  run the javascript and tell you when all the ajax calls have completed  You can inject your own javascript if necessary to trigger the appropriate clicks or whatever is necessary to trigger the site javascript to load the appropriate data   You now have two options  get it to spit out the finished html and parse it or inject some javascript into the page that does your parsing and data formatting and spits the data out  probably in json format   You can freely mix these two options as well   Which approach is best   That depends  you will need to be familiar and comfortable with the low level approach for sure  The embedded browser approach works for anything  it will be much easier to implement and will make some of the trickiest problems in scraping disappear  It s also quite a complex piece of machinery that you will need to understand  It s not just HTTP requests and responses  it s requests  embedded browser rendering  site javascript  injected javascript  your own code and 2-way interaction with the embedded browser process   The embedded browser is also much slower at scale because of the rendering overhead but that will almost certainly not matter unless you are scraping a lot of different domains  Your need to rate limit your requests will make the rendering time completely negligible in the case of a single domain   Rate Limiting Bot behaviour  You need to be very aware of this  You need to make requests to your target domains at a reasonable rate  You need to write a well behaved bot when crawling websites  and that means respecting robots txt and not hammering the server with requests  Mistakes or negligence here is very unethical since this can be considered a denial of service attack  The acceptable rate varies depending on who you ask  1req s is the max that the Google crawler runs at but you are not Google and you probably aren t as welcome as Google  Keep it as slow as reasonable  I would suggest 2-5 seconds between each page request   Identify your requests with a user agent string that identifies your bot and have a webpage for your bot explaining it s purpose  This url goes in the agent string   You will be easy to block if the site wants to block you  A smart engineer on their end can easily identify bots and a few minutes of work on their end can cause weeks of work changing your scraping code on your end or just make it impossible  If the relationship is antagonistic then a smart engineer at the target site can completely stymie a genius engineer writing a crawler  Scraping code is inherently fragile and this is easily exploited  Something that would provoke this response is almost certainly unethical anyway  so write a well behaved bot and don t worry about this   Testing  Not a unit integration test person  Too bad  You will now have to become one  Sites change frequently and you will be changing your code frequently  This is a large part of the challenge   There are a lot of moving parts involved in scraping a modern website  good test practices will help a lot  Many of the bugs you will encounter while writing this type of code will be the type that just return corrupted data silently  Without good tests to check for regressions you will find out that you ve been saving useless corrupted data to your database for a while without noticing  This project will make you very familiar with data validation  find some good libraries to use  and testing  There are not many other problems that combine requiring comprehensive tests and being very difficult to test   The second part of your tests involve caching and change detection  While writing your code you don t want to be hammering the server for the same page over and over again for no reason  While running your unit tests you want to know if your tests are failing because you broke your code or because the website has been redesigned  Run your unit tests against a cached copy of the urls involved  A caching proxy is very useful here but tricky to configure and use properly   You also do want to know if the site has changed  If they redesigned the site and your crawler is broken your unit tests will still pass because they are running against a cached copy  You will need either another  smaller set of integration tests that are run infrequently against the live site or good logging and error detection in your crawling code that logs the exact issues  alerts you to the problem and stops crawling  Now you can update your cache  run your unit tests and see what you need to change   Legal Issues  The law here can be slightly dangerous if you do stupid things  If the law gets involved you are dealing with people who regularly refer to wget and curl as  hacking tools   You don t want this   The ethical reality of the situation is that there is no difference between using browser software to request a url and look at some data and using your own software to request a url and look at some data  Google is the largest scraping company in the world and they are loved for it  Identifying your bots name in the user agent and being open about the goals and intentions of your web crawler will help here as the law understands what Google is  If you are doing anything shady  like creating fake user accounts or accessing areas of the site that you shouldn t  either  blocked  by robots txt or because of some kind of authorization exploit  then be aware that you are doing something unethical and the law s ignorance of technology will be extraordinarily dangerous here  It s a ridiculous situation but it s a real one   It s literally possible to try and build a new search engine on the up and up as an upstanding citizen  make a mistake or have a bug in your software and be seen as a hacker  Not something you want considering the current political reality   Who am I to write this giant wall of text anyway   I ve written a lot of web crawling related code in my life  I ve been doing web related software development for more than a decade as a consultant  employee and startup founder  The early days were writing perl crawlers scrapers and php websites  When we were embedding hidden iframes loading csv data into webpages to do ajax before Jesse James Garrett named it ajax  before XMLHTTPRequest was an idea  Before jQuery  before json  I m in my mid-30 s  that s apparently considered ancient for this business   I ve written large scale crawling scraping systems twice  once for a large team at a media company  in Perl  and recently for a small team as the CTO of a search engine startup  in Python Javascript   I currently work as a consultant  mostly coding in Clojure Clojurescript  a wonderful expert language in general and has libraries that make crawler scraper problems a delight   I ve written successful anti-crawling software systems as well  It s remarkably easy to write nigh-unscrapable sites if you want to or to identify and sabotage bots you don t like   I like writing crawlers  scrapers and parsers more than any other type of software  It s challenging  fun and can be used to create amazing things

[api] What's the best way of scraping data from a website?

Examples related to api

Examples related to web-scraping

Examples related to screen-scraping