Extracting text from a PDF file using PDFMiner in python

Question

I am looking for documentation or examples on how to extract text from a PDF file using PDFMiner with Python   It looks like PDFMiner updated their API and all the relevant examples I have found contain outdated code classes and methods have changed    The libraries I have found that make the task of extracting text from a PDF file easier are using the old PDFMiner syntax so I m not sure how to do this   As it is  I m just looking at source-code to see if I can figure it out

User · Answer

This works in May 2020 using PDFminer six in Python3.

Installing the package

$ pip install pdfminer.six

Importing the package

from pdfminer.high_level import extract_text

Using a PDF saved on disk

text = extract_text('report.pdf')

Or alternatively:

with open('report.pdf','rb') as f:
    text = extract_text(f)

Using PDF already in memory

If the PDF is already in memory, for example if retrieved from the web with the requests library, it can be converted to a stream using the io library:

import io

response = requests.get(url)
text = extract_text(io.BytesIO(response.content))

Performance and Reliability compared with PyPDF2

PDFminer.six works more reliably than PyPDF2 (which fails with certain types of PDFs), in particular PDF version 1.7

However, text extraction with PDFminer.six is significantly slower than PyPDF2 by a factor of 6.

I timed text extraction with timeit on a 15" MBP (2018), timing only the extraction function (no file opening etc.) with a 10 page PDF and got the following results:

PDFminer.six: 2.88 sec
PyPDF2:       0.45 sec

pdfminer.six also has a huge footprint, requiring pycryptodome which needs GCC and other things installed pushing a minimal install docker image on Alpine Linux from 80 MB to 350 MB. PyPDF2 has no noticeable storage impact.

User · Answer

terrific answer from DuckPuncher  for Python3 make sure you install pdfminer2 and do   import io  from pdfminer pdfinterp import PDFResourceManager  PDFPageInterpreter from pdfminer converter import TextConverter from pdfminer layout import LAParams from pdfminer pdfpage import PDFPage   def convert pdf to txt path       rsrcmgr   PDFResourceManager       retstr   io StringIO       codec    utf-8      laparams   LAParams       device   TextConverter rsrcmgr  retstr  codec codec  laparams laparams      fp   open path   rb       interpreter   PDFPageInterpreter rsrcmgr  device      password          maxpages   0     caching   True     pagenos   set        for page in PDFPage get pages fp  pagenos  maxpages maxpages                                    password password                                    caching caching                                    check extractable True           interpreter process page page         fp close       device close       text   retstr getvalue       retstr close       return text

User · Answer

this code is tested with pdfminer for python 3  pdfminer-20191125   from pdfminer layout import LAParams from pdfminer converter import PDFPageAggregator from pdfminer pdfinterp import PDFResourceManager from pdfminer pdfinterp import PDFPageInterpreter from pdfminer pdfpage import PDFPage from pdfminer layout import LTTextBoxHorizontal  def parsedocument document         convert all horizontal text into a lines list  one entry per line        document is a file stream     lines          rsrcmgr   PDFResourceManager       laparams   LAParams       device   PDFPageAggregator rsrcmgr  laparams laparams      interpreter   PDFPageInterpreter rsrcmgr  device      for page in PDFPage get pages document               interpreter process page page              layout   device get result               for element in layout                  if isinstance element  LTTextBoxHorizontal                       lines extend element get text   splitlines        return lines

User · Answer

Full disclosure  I am one of the maintainers of pdfminer six  Nowadays  there are multiple api s to extract text from a PDF  depending on your needs  Behind the scenes  all of these api s use the same logic for parsing and analyzing the layout   All the examples assume your PDF file is called example pdf  Commandline If you want to extract text just once you can use the commandline tool pdf2txt py    pdf2txt py example pdf  High-level api If you want to extract text with Python  you can use the high-level api  This approach is the go-to solution if you want to extract text programmatically from many PDF s  from pdfminer high level import extract text  text   extract text  example pdf    Composable api There is also a composable api that gives a lot of flexibility in handling the resulting objects  For example  you can implement your own layout algorithm using that  This method is suggested in the other answers  but I would only recommend this when you need to customize the way pdfminer six behaves  from io import StringIO  from pdfminer converter import TextConverter from pdfminer layout import LAParams from pdfminer pdfdocument import PDFDocument from pdfminer pdfinterp import PDFResourceManager  PDFPageInterpreter from pdfminer pdfpage import PDFPage from pdfminer pdfparser import PDFParser  output string   StringIO   with open  example pdf    rb   as in file      parser   PDFParser in file      doc   PDFDocument parser      rsrcmgr   PDFResourceManager       device   TextConverter rsrcmgr  output string  laparams LAParams        interpreter   PDFPageInterpreter rsrcmgr  device      for page in PDFPage create pages doc           interpreter process page page   print output string getvalue

User · Answer

Here is a working example of extracting text from a PDF file using the current version of PDFMiner September 2016    from pdfminer pdfinterp import PDFResourceManager  PDFPageInterpreter from pdfminer converter import TextConverter from pdfminer layout import LAParams from pdfminer pdfpage import PDFPage from io import StringIO  def convert pdf to txt path       rsrcmgr   PDFResourceManager       retstr   StringIO       codec    utf-8      laparams   LAParams       device   TextConverter rsrcmgr  retstr  codec codec  laparams laparams      fp   open path   rb       interpreter   PDFPageInterpreter rsrcmgr  device      password          maxpages   0     caching   True     pagenos set        for page in PDFPage get pages fp  pagenos  maxpages maxpages  password password caching caching  check extractable True           interpreter process page page       text   retstr getvalue        fp close       device close       retstr close       return text   PDFMiner s structure changed recently  so this should work for extracting text from the PDF files   Edit   Still working as of the June 7th of 2018  Verified in Python Version 3 x  Edit  The solution works with Python 3 7 at October 3  2019  I used the Python library pdfminer six  released on November 2018

[python] Extracting text from a PDF file using PDFMiner in python?

Installing the package

Importing the package

Using a PDF saved on disk

Using PDF already in memory

Performance and Reliability compared with PyPDF2

Examples related to python

Examples related to python-3.x

Examples related to python-2.7

Examples related to text-extraction

Examples related to pdfminer