I'd like to fetch results from Google using curl to detect potential duplicate content. Is there a high risk of being banned by Google?
This question is related to
web-scraping
Google disallows automated access in their TOS, so if you accept their terms you would break them.
That said, I know of no lawsuit from Google against a scraper. Even Microsoft scraped Google, they powered their search engine Bing with it. They got caught in 2011 red handed :)
There are two options to scrape Google results:
1) Use their API
UPDATE 2020: Google has reprecated previous APIs (again) and has new prices and new limits. Now (https://developers.google.com/custom-search/v1/overview) you can query up to 10k results per day at 1,500 USD per month, more than that is not permitted and the results are not what they display in normal searches.
You can issue around 40 requests per hour You are limited to what they give you, it's not really useful if you want to track ranking positions or what a real user would see. That's something you are not allowed to gather.
If you want a higher amount of API requests you need to pay.
60 requests per hour cost 2000 USD per year, more queries require a custom deal.
2) Scrape the normal result pages
3) Alternatively use a scraping service (updated)
Google thrives on scraping websites of the world...so if it was "so illegal" then even Google won't survive ..of course other answers mention ways of mitigating IP blocks by Google. One more way to explore avoiding captcha could be scraping at random times (dint try) ..Moreover, I have a feeling, that if we provide novelty or some significant processing of data then it sounds fine at least to me...if we are simply copying a website.. or hampering its business/brand in some way...then it is bad and should be avoided..on top of it all...if you are a startup then no one will fight you as there is no benefit.. but if your entire premise is on scraping even when you are funded then you should think of more sophisticated ways...alternative APIs..eventually..Also Google keeps releasing (or depricating) fields for its API so what you want to scrap now may be in roadmap of new Google API releases..
Source: Stackoverflow.com