How to extract text from a PDF file

Question

I m trying to extract the text included in this PDF file using Python   I m using the PyPDF2 module  and have the following script   import PyPDF2 pdf file   open  sample pdf   read pdf   PyPDF2 PdfFileReader pdf file  number of pages   read pdf getNumPages   page   read pdf getPage 0  page content   page extractText   print page content   When I run the code  I get the following output which is different from that included in the PDF document            amp    amp        -   01  23 4 5  1  26 3  7    8  amp   26 8 3  3   313 9  amp       How can I extract the text as is in the PDF document

User · Answer

PyPDF2 in some cases ignores the white spaces and makes the result text a mess  but I use PyMuPDF and I m really satisfied you can use this link for more info

User · Answer

Use pdfminer six  Here is the the doc   https   pdfminersix readthedocs io en latest index html To convert pdf to text       def pdf to text            from pdfminer high level import extract text          text   extract text  test pdf           print text

User · Answer

I found a solution here PDFLayoutTextStripper  It s good because it can keep the layout of the original PDF   It s written in Java but I have added a Gateway to support Python   Sample code   from py4j java gateway import JavaGateway  gw   JavaGateway   result   gw entry point strip  samples bus pdf      result is a dict of        success    true  or  false        payload   pdf file content if  success  is  true       error   error message if  success  is  false       print result  payload     Sample output from PDFLayoutTextStripper    You can see more details here Stripper with Python

User · Answer

I ve got a better work around than OCR and to maintain the page alignment while extracting the text from a PDF  Should be of help   from pdfminer pdfinterp import PDFResourceManager  PDFPageInterpreter from pdfminer converter import TextConverter from pdfminer layout import LAParams from pdfminer pdfpage import PDFPage from io import StringIO  def convert pdf to txt path       rsrcmgr   PDFResourceManager       retstr   StringIO       codec    utf-8      laparams   LAParams       device   TextConverter rsrcmgr  retstr  codec codec  laparams laparams      fp   open path   rb       interpreter   PDFPageInterpreter rsrcmgr  device      password          maxpages   0     caching   True     pagenos set         for page in PDFPage get pages fp  pagenos  maxpages maxpages  password password caching caching  check extractable True           interpreter process page page        text   retstr getvalue        fp close       device close       retstr close       return text  text  convert pdf to txt  test pdf   print text

User · Answer

Use textract    http   textract readthedocs io en latest  https   github com deanmalmgren textract   It supports many types of files including PDFs  import textract text   textract process  path to file extension

User · Answer

I ve try many Python PDF converters  and I like to update this review   Tika is one of the best  But PyMuPDF is a good news from  ehsaneha user   I did a code to compare them in  https   github com erfelipe PDFtextExtraction I hope to help you       Tika-Python is a Python binding to the Apache Tika    REST services   allowing Tika to be called natively in the Python community    from tika import parser  raw   parser from file     Users Documents Textos Texto1 pdf   raw   str raw   safe text   raw encode  utf-8   errors  ignore    safe text   str safe text  replace   n       replace           print  --- safe text ---    print  safe text

User · Answer

How to extract text from a PDF file   The first thing to understand is the PDF format  It has a public specification written in English  see ISO 32000-2 2017 and read the more than 700 pages of PDF 1 7 specification  You certainly at least need to read the wikipedia page about PDF Once you understood the details of the PDF format  extracting text is more or less easy  but what about text appearing in figures or images  its figure 1   Don t expect writing a perfect software text extractor alone in a few weeks     On Linux  you might also use pdf2text which you could popen from your Python code  In general  extracting text from a PDF file is an ill defined problem  For a human reader some text could be made  as a figure  from different dots  or a photo  etc    The Google search engine is capable of extracting text from PDF  but is rumored to need more than half a billion lines of source code  Do you have the necessary resources  in man power  in budget  to develop a competitor  A possibility might be to print the PDF to some virtual printer  e g  using GhostScript or Firefox   then to use OCR techniques to extract text  I would recommend instead to work on the data representation which has generated that PDF file  for example on the original LaTeX code  or Lout code  or on OOXML code  In all cases  you need to budget at least several person years of software development

User · Answer

I recommend to use pymupdf or pdfminer six  Those packages are not maintained   PyPDF2  PyPDF3  PyPDF4 pdfminer  without  six   How to read pure text with pymupdf There are different options which will give different results  but the most basic one is  import fitz    this is pymupdf  with fitz open  quot my pdf quot   as doc      text    quot  quot      for page in doc          text    page getText    print text

User · Answer

pdftotext is the best and simplest one  pdftotext also reserves the structure as well   I tried PyPDF2  PDFMiner and a few others but none of them gave a satisfactory result

User · Answer

Look at this code   import PyPDF2 pdf file   open  sample pdf    rb   read pdf   PyPDF2 PdfFileReader pdf file  number of pages   read pdf getNumPages   page   read pdf getPage 0  page content   page extractText   print page content encode  utf-8     The output is             amp    amp        -   01  23 4 5  1  26 3  7    8  amp   26 8 3  3   313 9  amp       Using the same code to read a pdf from 201308FCR pdf  The output is normal   Its documentation explains why    def extractText self               Locate all text drawing commands  in the order they are provided in the     content stream  and extract the text   This works well for some PDF     files  but poorly for others  depending on the generator used   This will     be refined in the future   Do not rely on the order of text coming out of     this function  as it will change if this function is made more     sophisticated       return  a unicode string object

User · Answer

You can download tika-app-xxx jar latest  from Here   Then put this  jar file in the same folder of your python script file    then insert the following code in the script   import os import os path  tika dir os path join os path dirname   file      lt tika-app-xxx gt  jar    def extract pdf source pdf str target txt str       os system  java -jar   tika dir   -t     gt      format source pdf target txt     The advantage of this method   fewer dependency  Single  jar file is easier to manage that a python package   multi-format support  The position source pdf can be the directory of any kind of document    doc   html   odt  etc    up-to-date  tika-app jar always release earlier than the relevant version of tika python package    stable  It is far more stable and well-maintained  Powered by Apache  than PyPDF   disadvantage   A jre-headless is necessary

User · Answer

PyPDF2 does work  but results may vary  I am seeing quite inconsistent findings from its result extraction   reader PyPDF2 pdf PdfFileReader self  path  eachPageText    for i in range 0 reader getNumPages         pageText reader getPage i  extractText       print pageText      eachPageText append pageText

User · Answer

I am adding code to accomplish this  It is working fine for me     This works in python 3   required python packages   tabula-py  1 0 0   PyPDF2  1 26 0   Pillow  4 0 0   pdfminer six  20170720  import os import shutil import warnings from io import StringIO  import requests import tabula from PIL import Image from PyPDF2 import PdfFileWriter  PdfFileReader from pdfminer converter import TextConverter from pdfminer layout import LAParams from pdfminer pdfinterp import PDFResourceManager  PDFPageInterpreter from pdfminer pdfpage import PDFPage  warnings filterwarnings  ignore     def download file url       local filename   url split      -1      local filename   local filename replace   20            r   requests get url  stream True      print r      with open local filename   wb   as f          shutil copyfileobj r raw  f       return local filename   class PDFExtractor        def   init   self  url           self url   url        Downloading File in local     def break pdf self  filename  start page -1  end page -1           pdf reader   PdfFileReader open filename   rb              Reading each pdf one by one         total pages   pdf reader numPages         if start page    -1              start page   0         elif start page  lt  1 or start page  gt  total pages              return  Start Page Selection Is Wrong          else              start page   start page - 1          if end page    -1              end page   total pages         elif end page  lt  1 or end page  gt  total pages - 1              return  End Page Selection Is Wrong          else              end page   end page          for i in range start page  end page               output   PdfFileWriter               output addPage pdf reader getPage i               with open str i   1          filename   wb   as outputStream                  output write outputStream       def extract text algo 1 self  file           pdf reader   PdfFileReader open file   rb              creating a page object         pageObj   pdf reader getPage 0             extracting extract text from page         text   pageObj extractText           text   text replace   n       replace   t               return text      def extract text algo 2 self  file           pdfResourceManager   PDFResourceManager           retstr   StringIO           la params   LAParams           device   TextConverter pdfResourceManager  retstr  codec  utf-8   laparams la params          fp   open file   rb           interpreter   PDFPageInterpreter pdfResourceManager  device          password              max pages   0         caching   True         page num   set            for page in PDFPage get pages fp  page num  maxpages max pages  password password  caching caching                                        check extractable True               interpreter process page page           text   retstr getvalue           text   text replace   t       replace   n                fp close           device close           retstr close           return text      def extract text self  file           text1   self extract text algo 1 file          text2   self extract text algo 2 file           if len text2   gt  len str text1                return text2         else              return text1      def extarct table self  file              Read pdf into DataFrame         try              df   tabula read pdf file  output format  csv           except              print  Error Reading Table               return          print   nPrinting Table Content   n   df          print   nDone Printing Table Content n        def tiff header for CCITT self  width  height  img size  CCITT group 4           tiff header struct     lt      2s     h     l     h     hhll    8    h          return struct pack tiff header struct                             b II      Byte order indication  Little indian                            42     Version number  always 42                             8     Offset to first IFD                            8     Number of tags in IFD                            256  4  1  width     ImageWidth  LONG  1  width                            257  4  1  height     ImageLength  LONG  1  lenght                            258  3  1  1     BitsPerSample  SHORT  1  1                            259  3  1  CCITT group     Compression  SHORT  1  4   CCITT Group 4 fax encoding                            262  3  1  0     Threshholding  SHORT  1  0   WhiteIsZero                            273  4  1  struct calcsize tiff header struct      StripOffsets  LONG  1  len of header                            278  4  1  height     RowsPerStrip  LONG  1  lenght                            279  4  1  img size     StripByteCounts  LONG  1  size of extract image                            0    last IFD                                   def extract image self  filename           number   1         pdf reader   PdfFileReader open filename   rb             for i in range 0  pdf reader numPages                page   pdf reader getPage i               try                  xObject   page   Resources     XObject   getObject               except                  print  No XObject Found                   return              for obj in xObject                   try                       if xObject obj    Subtype        Image                           size    xObject obj    Width    xObject obj    Height                            data   xObject obj   data                         if xObject obj    ColorSpace        DeviceRGB                               mode    RGB                          else                              mode    P                           image name   filename split      0    str number                           print xObject obj    Filter                             if xObject obj    Filter        FlateDecode                               data   xObject obj  getData                               img   Image frombytes mode  size  data                              img save image name     Flate png                                 save to s3 imagename     Flate png                               print  Image Saved                                number    1                         elif xObject obj    Filter        DCTDecode                               img   open image name     DCT jpg    wb                               img write data                                save to s3 imagename     DCT jpg                               img close                               number    1                         elif xObject obj    Filter        JPXDecode                               img   open image name     JPX jp2    wb                               img write data                                save to s3 imagename     JPX jp2                               img close                               number    1                         elif xObject obj    Filter        CCITTFaxDecode                               if xObject obj    DecodeParms     K      -1                                  CCITT group   4                             else                                  CCITT group   3                             width   xObject obj    Width                               height   xObject obj    Height                               data   xObject obj   data    sorry  getData   does not work for CCITTFaxDecode                             img size   len data                              tiff header   self tiff header for CCITT width  height  img size  CCITT group                              img name   image name     CCITT tiff                              with open img name   wb   as img file                                  img file write tiff header   data                                 save to s3 img name                              number    1                 except                      continue          return number      def read pages self  start page -1  end page -1              Downloading file locally         downloaded file   download file self url          print downloaded file             breaking PDF into number of pages in diff pdf files         self break pdf downloaded file  start page  end page             creating a pdf reader object         pdf reader   PdfFileReader open downloaded file   rb               Reading each pdf one by one         total pages   pdf reader numPages          if start page    -1              start page   0         elif start page  lt  1 or start page  gt  total pages              return  Start Page Selection Is Wrong          else              start page   start page - 1          if end page    -1              end page   total pages         elif end page  lt  1 or end page  gt  total pages - 1              return  End Page Selection Is Wrong          else              end page   end page          for i in range start page  end page                 creating a page based filename             file   str i   1          downloaded file              print   nStarting to Read Page     i   1    n -----------   -------------                file text   self extract text file              print file text              self extract image file               self extarct table file              os remove file              print  Stopped Reading Page     i   1    n -----------   -------------            os remove downloaded file      I have tested on these 3 pdf files   url    http   s3 amazonaws com NLP Project Original Documents Healthcare-January-2017 pdf  url    http   s3 amazonaws com NLP Project Original Documents Sample Test pdf    url    http   s3 amazonaws com NLP Project Original Documents Sazerac FS 2017 06 30 20Annual pdf    creating the instance of class pdf extractor   PDFExtractor url     Getting desired data out pdf extractor read pages 15  23

User · Answer

If you try it in Anaconda on Windows  PyPDF2 might not handle some of the PDFs with non-standard structure or unicode characters  I recommend using the following code if you need to open and read a lot of pdf files - the text of all pdf files in folder with relative path    pdfs   will be stored in list pdf text list   from tika import parser import glob  def read pdf filename       text   parser from file filename      return text    all files   glob glob     pdfs    pdf   pdf text list    for i file in enumerate all files       text read pdf file      pdf text list append text  content     print pdf text list

User · Answer

After trying textract  which seemed to have too many dependencies  and pypdf2  which could not extract text from the pdfs I tested with  and tika  which was too slow  I ended up using pdftotext from xpdf  as already suggested in another answer  and just called the binary from python directly  you may need to adapt the path to pdftotext    import os  subprocess SCRIPT DIR   os path dirname os path abspath   file     args      usr local bin pdftotext            -enc            UTF-8               my-pdf pdf  format SCRIPT DIR            -   res   subprocess run args  stdout subprocess PIPE  stderr subprocess PIPE  output   res stdout decode  utf-8     There is pdftotext which does basically the same but this assumes pdftotext in  usr local bin whereas I am using this in AWS lambda and wanted to use it from the current directory   Btw  For using this on lambda you need to put the binary and the dependency to libstdc   so into your lambda function  I personally needed to compile xpdf  As instructions for this would blow up this answer I put them on my personal blog

User · Answer

For extracting Text from PDF use below code  import PyPDF2 pdfFileObj   open  mypdf pdf    rb    pdfReader   PyPDF2 PdfFileReader pdfFileObj   print pdfReader numPages   pageObj   pdfReader getPage 0   a   pageObj extractText    print a

User · Answer

If wanting to extract text from a table  I ve found tabula to be easily implemented  accurate  and fast  to get a pandas dataframe  import tabula  df   tabula read pdf  your pdf    df  By default  it ignores page content outside of the table  So far  I ve only tested on a single-page  single-table file  but there are kwargs to accommodate multiple pages and or multiple tables  install via  pip install tabula-py   or conda install -c conda-forge tabula-py   In terms of straight-up text extraction see  https   stackoverflow com a 63190886 9249533

User · Answer

In 2020 the solutions above were not working for the particular pdf I was working with  Below is what did the trick  I am on Windows 10 and Python 3 8 Test pdf file  https   drive google com file d 1aUfQAlvq5hA9kz2c9CyJADiY3KpY3-Vn view usp sharing  pip install pdfminer six import io  from pdfminer pdfinterp import PDFResourceManager  PDFPageInterpreter from pdfminer converter import TextConverter from pdfminer layout import LAParams from pdfminer pdfpage import PDFPage   def convert pdf to txt path          Convert pdf content from a file path to text       path the file path             rsrcmgr   PDFResourceManager       codec    utf-8      laparams   LAParams        with io StringIO   as retstr          with TextConverter rsrcmgr  retstr  codec codec                             laparams laparams  as device              with open path   rb   as fp                  interpreter   PDFPageInterpreter rsrcmgr  device                  password    quot  quot                  maxpages   0                 caching   True                 pagenos   set                    for page in PDFPage get pages fp                                                pagenos                                                maxpages maxpages                                                password password                                                caching caching                                                check extractable True                       interpreter process page page                   return retstr getvalue     if   name       quot   main   quot       print convert pdf to txt  C   Path  To  Test PDF pdf

User · Answer

You may want to use time proved xPDF and derived tools to extract text instead as pyPDF2 seems to have various issues with the text extraction still   The long answer is that there are lot of variations how a text is encoded inside PDF and that it may require to decoded PDF string itself  then may need to map with CMAP  then may need to analyze distance between words and letters etc   In case the PDF is damaged  i e  displaying the correct text but when copying it gives garbage  and you really need to extract text  then you may want to consider converting PDF into image  using ImageMagik  and then use Tesseract to get text from image using OCR

User · Answer

You can use PDFtoText https   github com jalan pdftotext  PDF to text keeps text format indentation  doesn t matter if you have tables

User · Answer

Camelot seems a fairly powerful solution to extract tables from PDFs in Python  At first sight it seems to achieve almost as accurate extraction as the tabula-py package suggested by CreekGeek  which is already waaaaay above any other posted solution as of today in terms of reliability  but it is supposedly much more configurable   Furthermore it has its own accuracy indicator  results parsing report   and great debugging features  Both Camelot and Tabula provide the results as Pandas    DataFrames  so it is easy to adjust tables afterwards  pip install camelot-py   Not to be confused with the camelot package   import camelot  df list      results   camelot read pdf  quot file pdf quot        for table in results      print table parsing report      df list append results 0  df   It can also output results as CSV  JSON  HTML or Excel  Camelot comes at the expense of a number of dependencies  NB    Since my input is pretty complex with many different tables I ended up using both Camelot and Tabula  depending on the table  to achieve the best results

User · Answer

A more robust way  supposing there are multiple PDF s or just one   import os from PyPDF2 import PdfFileWriter  PdfFileReader from io import BytesIO  mydir     specify path to your directory where PDF or PDF s are  for arch in os listdir mydir        buffer   io BytesIO       archpath   os path join mydir  arch      with open archpath  as f              pdfFileObj   open archpath   rb               pdfReader   PyPDF2 PdfFileReader pdfFileObj              pdfReader numPages             pageObj   pdfReader getPage 0               ley   pageObj extractText               file1   open  quot myfile txt quot   quot w quot               file1 writelines ley              file1 close

User · Answer

The below code is a solution to the question in Python 3  Before running the code  make sure you have installed the PyPDF2 library in your environment  If not installed  open the command prompt and run the following command   pip3 install PyPDF2   Solution Code   import PyPDF2 pdfFileObject   open  sample pdf    rb   pdfReader   PyPDF2 PdfFileReader pdfFileObject  count   pdfReader numPages for i in range count       page   pdfReader getPage i      print page extractText

User · Answer

Here is the simplest code for extracting text  code     importing required modules import PyPDF2    creating a pdf file object pdfFileObj   open  filename pdf    rb      creating a pdf reader object pdfReader   PyPDF2 PdfFileReader pdfFileObj     printing number of pages in pdf file print pdfReader numPages     creating a page object pageObj   pdfReader getPage 5     extracting text from page print pageObj extractText       closing the pdf file object pdfFileObj close

User · Answer

Multi - page pdf can be extracted as text at single stretch instead of giving individual page number as argument using below code  import PyPDF2 import collections pdf file   open  samples pdf    rb   read pdf   PyPDF2 PdfFileReader pdf file  number of pages   read pdf getNumPages   c   collections Counter range number of pages   for i in c     page   read pdf getPage i     page content   page extractText      print page content encode  utf-8

User · Answer

I was looking for a simple solution to use for python 3 x and windows  There doesn t seem to be support from textract  which is unfortunate  but if you are looking for a simple solution for windows python 3 checkout the tika package  really straight forward for reading pdfs       Tika-Python is a Python binding to the Apache Tika    REST services allowing Tika to be called natively in the Python community     from tika import parser   pip install tika  raw   parser from file  sample pdf   print raw  content      Note that Tika is written in Java so you will need a Java runtime installed

[python] How to extract text from a PDF file?

Examples related to python

Examples related to pdf