I am trying to download a PDF file from a website and save it to disk. My attempts either fail with encoding errors or result in blank PDFs.
In [1]: import requests
In [2]: url = 'http://www.hrecos.org//images/Data/forweb/HRTVBSH.Metadata.pdf'
In [3]: response = requests.get(url)
In [4]: with open('/tmp/metadata.pdf', 'wb') as f:
...: f.write(response.text)
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-4-4be915a4f032> in <module>()
1 with open('/tmp/metadata.pdf', 'wb') as f:
----> 2 f.write(response.text)
3
UnicodeEncodeError: 'ascii' codec can't encode characters in position 11-14: ordinal not in range(128)
In [5]: import codecs
In [6]: with codecs.open('/tmp/metadata.pdf', 'wb', encoding='utf8') as f:
...: f.write(response.text)
...:
I know it is a codec problem of some kind but I can't seem to get it to work.
This question is related to
python
python-2.7
python-requests
regarding Kevin answer to write in a folder tmp
, it should be like this:
with open('./tmp/metadata.pdf', 'wb') as f:
f.write(response.content)
he forgot .
before the address and of-course your folder tmp
should have been created already
Please note I'm a beginner. If My solution is wrong, please feel free to correct and/or let me know. I may learn something new too.
My solution:
Change the downloadPath accordingly to where you want your file to be saved. Feel free to use the absolute path too for your usage.
Save the below as downloadFile.py.
Usage: python downloadFile.py url-of-the-file-to-download new-file-name.extension
Remember to add an extension!
Example usage: python downloadFile.py http://www.google.co.uk google.html
import requests
import sys
import os
def downloadFile(url, fileName):
with open(fileName, "wb") as file:
response = requests.get(url)
file.write(response.content)
scriptPath = sys.path[0]
downloadPath = os.path.join(scriptPath, '../Downloads/')
url = sys.argv[1]
fileName = sys.argv[2]
print('path of the script: ' + scriptPath)
print('downloading file to: ' + downloadPath)
downloadFile(url, downloadPath + fileName)
print('file downloaded...')
print('exiting program...')
Generally, this should work in Python3:
import urllib.request
..
urllib.request.get(url)
Remember that urllib and urllib2 don't work properly after Python2.
If in some mysterious cases requests don't work (happened with me), you can also try using
wget.download(url)
Related:
Here's a decent explanation/solution to find and download all pdf files on a webpage:
In Python 3, I find pathlib is the easiest way to do this. Request's response.content marries up nicely with pathlib's write_bytes.
from pathlib import Path
import requests
filename = Path('metadata.pdf')
url = 'http://www.hrecos.org//images/Data/forweb/HRTVBSH.Metadata.pdf'
response = requests.get(url)
filename.write_bytes(response.content)
You can use urllib:
import urllib.request
urllib.request.urlretrieve(url, "filename.pdf")
Source: Stackoverflow.com