[python] How to use Python requests to fake a browser visit a.k.a and generate User Agent?

I want to get the content from this website.

If I use a browser like Firefox or Chrome I could get the real website page I want, but if I use the Python requests package (or wget command) to get it, it returns a totally different HTML page.

I thought the developer of the website had made some blocks for this.

Question

How do I fake a browser visit by using python requests or command wget?

This question is related to python web-scraping python-requests wget user-agent

The answer is


Provide a User-Agent header:

import requests

url = 'http://www.ichangtou.com/#company:data_000008.html'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

response = requests.get(url, headers=headers)
print(response.content)

FYI, here is a list of User-Agent strings for different browsers:


As a side note, there is a pretty useful third-party package called fake-useragent that provides a nice abstraction layer over user agents:

fake-useragent

Up to date simple useragent faker with real world database

Demo:

>>> from fake_useragent import UserAgent
>>> ua = UserAgent()
>>> ua.chrome
u'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36'
>>> ua.random
u'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.67 Safari/537.36'

This is how, I have been using a random user agent from a list of nearlly 1000 fake user agents

from random_user_agent.user_agent import UserAgent
from random_user_agent.params import SoftwareName, OperatingSystem
software_names = [SoftwareName.ANDROID.value]
operating_systems = [OperatingSystem.WINDOWS.value, OperatingSystem.LINUX.value, OperatingSystem.MAC.value]   

user_agent_rotator = UserAgent(software_names=software_names, operating_systems=operating_systems, limit=1000)

# Get list of user agents.
user_agents = user_agent_rotator.get_user_agents()

user_agent_random = user_agent_rotator.get_random_user_agent()

Example

print(user_agent_random)

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36

For more details visit this link


The root of the answer is that the person asking the question needs to have a JavaScript interpreter to get what they are after. What I have found is I am able to get all of the information I wanted on a website in json before it was interpreted by JavaScript. This has saved me a ton of time in what would be parsing html hoping each webpage is in the same format.

So when you get a response from a website using requests really look at the html/text because you might find the javascripts JSON in the footer ready to be parsed.


Answer

You need to create a header with a proper formatted User agent String, it server to communicate client-server.

You can check your own user agent Here.

Example

Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0
Mozilla/5.0 (Macintosh; Intel Mac OS X x.y; rv:42.0) Gecko/20100101 Firefox/42.0

Third party Package user_agent 0.1.9

I found this module very simple to use, in one line of code it randomly generates a User agent string.

from user_agent import generate_user_agent, generate_navigator
from pprint import pprint

print(generate_user_agent())
# 'Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.3; Win64; x64)'

print(generate_user_agent(os=('mac', 'linux')))
# 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:36.0) Gecko/20100101 Firefox/36.0'

pprint(generate_navigator())

# {'app_code_name': 'Mozilla',
#  'app_name': 'Netscape',
#  'appversion': '5.0',
#  'name': 'firefox',
#  'os': 'linux',
#  'oscpu': 'Linux i686 on x86_64',
#  'platform': 'Linux i686 on x86_64',
#  'user_agent': 'Mozilla/5.0 (X11; Ubuntu; Linux i686 on x86_64; rv:41.0) Gecko/20100101 Firefox/41.0',
#  'version': '41.0'}

pprint(generate_navigator_js())

# {'appCodeName': 'Mozilla',
#  'appName': 'Netscape',
#  'appVersion': '38.0',
#  'platform': 'MacIntel',
#  'userAgent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:38.0) Gecko/20100101 Firefox/38.0'}

Try doing this, using firefox as fake user agent (moreover, it's a good startup script for web scraping with the use of cookies):

#!/usr/bin/env python2
# -*- coding: utf8 -*-
# vim:ts=4:sw=4


import cookielib, urllib2, sys

def doIt(uri):
    cj = cookielib.CookieJar()
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
    page = opener.open(uri)
    page.addheaders = [('User-agent', 'Mozilla/5.0')]
    print page.read()

for i in sys.argv[1:]:
    doIt(i)

USAGE:

python script.py "http://www.ichangtou.com/#company:data_000008.html"

I used fake UserAgent.

How to use:

from fake_useragent import UserAgent
import requests
   

ua = UserAgent()
print(ua.chrome)
header = {'User-Agent':str(ua.chrome)}
print(header)
url = "https://www.hybrid-analysis.com/recent-submissions?filter=file&sort=^timestamp"
htmlContent = requests.get(url, headers=header)
print(htmlContent)

Output:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1309.0 Safari/537.17
{'User-Agent': 'Mozilla/5.0 (X11; OpenBSD i386) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36'}
<Response [200]>

I had a similar issue but I was unable to use the UserAgent class inside the fake_useragent module. I was running the code inside a docker container

import requests
import ujson
import random

response = requests.get('https://fake-useragent.herokuapp.com/browsers/0.1.11')
agents_dictionary = ujson.loads(response.text)
random_browser_number = str(random.randint(0, len(agents_dictionary['randomize'])))
random_browser = agents_dictionary['randomize'][random_browser_number]
user_agents_list = agents_dictionary['browsers'][random_browser]
user_agent = user_agents_list[random.randint(0, len(user_agents_list)-1)]

I targeted the endpoint used in the module. This solution still gave me a random user agent however there is the possibility that the data structure at the endpoint could change.


Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to web-scraping

Scraping: SSL: CERTIFICATE_VERIFY_FAILED error for http://en.wikipedia.org How to print an exception in Python 3? What should I use to open a url instead of urlopen in urllib3 Use Excel VBA to click on a button in Internet Explorer, when the button has no "name" associated How to use Python requests to fake a browser visit a.k.a and generate User Agent? Scraping data from website using vba Using BeautifulSoup to extract text without tags Is it ok to scrape data from Google results? What's the best way of scraping data from a website? Use getElementById on HTMLElement instead of HTMLDocument

Examples related to python-requests

Requests (Caused by SSLError("Can't connect to HTTPS URL because the SSL module is not available.") Error in PyCharm requesting website Use python requests to download CSV Download and save PDF file with Python requests module ImportError: No module named 'Queue' How to use cookies in Python Requests Saving response from Requests to file How to get Python requests to trust a self signed SSL certificate? How to install requests module in Python 3.4, instead of 2.7 Upload Image using POST form data in Python-requests SSL InsecurePlatform error when using Requests package

Examples related to wget

How to `wget` a list of URLs in a text file? How to install wget in macOS? wget ssl alert handshake failure How to run wget inside Ubuntu Docker image? Unable to establish SSL connection upon wget on Ubuntu 14.04 LTS How to use Python requests to fake a browser visit a.k.a and generate User Agent? wget/curl large file from google drive wget: unable to resolve host address `http' Python equivalent of a given wget command How to download HTTP directory with all files and sub-directories as they appear on the online files/folders list?

Examples related to user-agent

Change user-agent for Selenium web-driver How to use curl to get a GET request exactly same as using Chrome? How to use Python requests to fake a browser visit a.k.a and generate User Agent? Get operating system info Sites not accepting wget user agent header How does HTTP_USER_AGENT work? What is the iOS 6 user agent string? Detect IE version (prior to v9) in JavaScript How to get user agent in PHP How to find the operating system version using JavaScript?