This works in May 2020 using PDFminer six in Python3.
$ pip install pdfminer.six
from pdfminer.high_level import extract_text
text = extract_text('report.pdf')
Or alternatively:
with open('report.pdf','rb') as f:
text = extract_text(f)
If the PDF is already in memory, for example if retrieved from the web with the requests library, it can be converted to a stream using the io
library:
import io
response = requests.get(url)
text = extract_text(io.BytesIO(response.content))
PDFminer.six works more reliably than PyPDF2 (which fails with certain types of PDFs), in particular PDF version 1.7
However, text extraction with PDFminer.six is significantly slower than PyPDF2 by a factor of 6.
I timed text extraction with timeit
on a 15" MBP (2018), timing only the extraction function (no file opening etc.) with a 10 page PDF and got the following results:
PDFminer.six: 2.88 sec
PyPDF2: 0.45 sec
pdfminer.six also has a huge footprint, requiring pycryptodome which needs GCC and other things installed pushing a minimal install docker image on Alpine Linux from 80 MB to 350 MB. PyPDF2 has no noticeable storage impact.