[python] Extracting text from a PDF file using PDFMiner in python?

This works in May 2020 using PDFminer six in Python3.

Installing the package

$ pip install pdfminer.six

Importing the package

from pdfminer.high_level import extract_text

Using a PDF saved on disk

text = extract_text('report.pdf')

Or alternatively:

with open('report.pdf','rb') as f:
    text = extract_text(f)

Using PDF already in memory

If the PDF is already in memory, for example if retrieved from the web with the requests library, it can be converted to a stream using the io library:

import io

response = requests.get(url)
text = extract_text(io.BytesIO(response.content))

Performance and Reliability compared with PyPDF2

PDFminer.six works more reliably than PyPDF2 (which fails with certain types of PDFs), in particular PDF version 1.7

However, text extraction with PDFminer.six is significantly slower than PyPDF2 by a factor of 6.

I timed text extraction with timeit on a 15" MBP (2018), timing only the extraction function (no file opening etc.) with a 10 page PDF and got the following results:

PDFminer.six: 2.88 sec
PyPDF2:       0.45 sec

pdfminer.six also has a huge footprint, requiring pycryptodome which needs GCC and other things installed pushing a minimal install docker image on Alpine Linux from 80 MB to 350 MB. PyPDF2 has no noticeable storage impact.

Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to python-3.x

Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation Replace specific text with a redacted version using Python Upgrade to python 3.8 using conda "Permission Denied" trying to run Python on Windows 10 Python: 'ModuleNotFoundError' when trying to import module from imported package What is the meaning of "Failed building wheel for X" in pip install? How to downgrade python from 3.7 to 3.6 I can't install pyaudio on Windows? How to solve "error: Microsoft Visual C++ 14.0 is required."? Iterating over arrays in Python 3 How to upgrade Python version to 3.7?

Examples related to python-2.7

Numpy, multiply array with scalar Not able to install Python packages [SSL: TLSV1_ALERT_PROTOCOL_VERSION] How to create a new text file using Python Could not find a version that satisfies the requirement tensorflow Python: Pandas pd.read_excel giving ImportError: Install xlrd >= 0.9.0 for Excel support Display/Print one column from a DataFrame of Series in Pandas How to calculate 1st and 3rd quartiles? How can I read pdf in python? How to completely uninstall python 2.7.13 on Ubuntu 16.04 Check key exist in python dict

Examples related to text-extraction

How can I read pdf in python? Extracting text from a PDF file using PDFMiner in python? Getting URL parameter in java and extract a specific text from that URL Extract a single (unsigned) integer from a string How to extract string following a pattern with grep, regex or perl How to extract a substring using regex How to extract text from a PDF? PDF Parsing Using Python - extracting formatted and plain texts Python module for converting PDF to text

Examples related to pdfminer

Extracting text from a PDF file using PDFMiner in python?