Python module for converting PDF to text

Question

Is there any python module to convert PDF files into text  I tried one piece of code found in Activestate which uses pypdf but the text generated had no space between and was of no use

User · Answer

Found that solution today  Works great for me  Even rendering PDF pages to PNG images  http   www swftools org gfx tutorial html

User · Answer

The PDFMiner package has changed since codeape posted     EDIT  again    PDFMiner has been updated again in version 20100213  You can check the version you have installed with the following    gt  gt  gt  import pdfminer  gt  gt  gt  pdfminer   version    20100213    Here s the updated version  with comments on what I changed added    def pdf to csv filename       from cStringIO import StringIO    lt -- added so you can copy paste this to try it     from pdfminer converter import LTTextItem  TextConverter     from pdfminer pdfparser import PDFDocument  PDFParser     from pdfminer pdfinterp import PDFResourceManager  PDFPageInterpreter      class CsvConverter TextConverter           def   init   self   args    kwargs               TextConverter   init   self   args    kwargs           def end page self  i               from collections import defaultdict             lines   defaultdict lambda                   for child in self cur item objs                  if isinstance child  LTTextItem                            x y    child bbox                     lt -- changed                     line   lines int -y                       line x    child text encode self codec     lt -- changed              for y in sorted lines keys                     line   lines y                  self outfp write     join line x  for x in sorted line keys                      self outfp write   n              the following part of the code is a remix of the        convert   function in the pdfminer tools pdf2text module     rsrc   PDFResourceManager       outfp   StringIO       device   CsvConverter rsrc  outfp  codec  utf-8      lt -- changed            becuase my test documents are utf-8  note  utf-8 is the default codec       doc   PDFDocument       fp   open filename   rb       parser   PDFParser fp          lt -- changed     parser set document doc        lt -- added     doc set parser parser          lt -- added     doc initialize          interpreter   PDFPageInterpreter rsrc  device       for i  page in enumerate doc get pages             outfp write  START PAGE  d n    i          interpreter process page page          outfp write  END PAGE  d n    i       device close       fp close        return outfp getvalue     Edit  yet again    Here is an update for the latest version in pypi  20100619p1   In short I replaced LTTextItem with LTChar and passed an instance of LAParams to the CsvConverter constructor   def pdf to csv filename       from cStringIO import StringIO       from pdfminer converter import LTChar  TextConverter      lt -- changed     from pdfminer layout import LAParams     from pdfminer pdfparser import PDFDocument  PDFParser     from pdfminer pdfinterp import PDFResourceManager  PDFPageInterpreter      class CsvConverter TextConverter           def   init   self   args    kwargs               TextConverter   init   self   args    kwargs           def end page self  i               from collections import defaultdict             lines   defaultdict lambda                   for child in self cur item objs                  if isinstance child  LTChar                   lt -- changed                          x y    child bbox                                        line   lines int -y                       line x    child text encode self codec               for y in sorted lines keys                     line   lines y                  self outfp write     join line x  for x in sorted line keys                      self outfp write   n              the following part of the code is a remix of the        convert   function in the pdfminer tools pdf2text module     rsrc   PDFResourceManager       outfp   StringIO       device   CsvConverter rsrc  outfp  codec  utf-8   laparams LAParams       lt -- changed           becuase my test documents are utf-8  note  utf-8 is the default codec       doc   PDFDocument       fp   open filename   rb       parser   PDFParser fp             parser set document doc           doc set parser parser             doc initialize          interpreter   PDFPageInterpreter rsrc  device       for i  page in enumerate doc get pages             outfp write  START PAGE  d n    i          if page is not None              interpreter process page page          outfp write  END PAGE  d n    i       device close       fp close        return outfp getvalue     EDIT  one more time    Updated for version 20110515  thanks to Oeufcoque Penteano     def pdf to csv filename       from cStringIO import StringIO       from pdfminer converter import LTChar  TextConverter     from pdfminer layout import LAParams     from pdfminer pdfparser import PDFDocument  PDFParser     from pdfminer pdfinterp import PDFResourceManager  PDFPageInterpreter      class CsvConverter TextConverter           def   init   self   args    kwargs               TextConverter   init   self   args    kwargs           def end page self  i               from collections import defaultdict             lines   defaultdict lambda                   for child in self cur item  objs                   lt -- changed                 if isinstance child  LTChar                            x y    child bbox                                        line   lines int -y                       line x    child  text encode self codec    lt -- changed              for y in sorted lines keys                     line   lines y                  self outfp write     join line x  for x in sorted line keys                      self outfp write   n              the following part of the code is a remix of the        convert   function in the pdfminer tools pdf2text module     rsrc   PDFResourceManager       outfp   StringIO       device   CsvConverter rsrc  outfp  codec  utf-8   laparams LAParams              becuase my test documents are utf-8  note  utf-8 is the default codec       doc   PDFDocument       fp   open filename   rb       parser   PDFParser fp             parser set document doc           doc set parser parser             doc initialize          interpreter   PDFPageInterpreter rsrc  device       for i  page in enumerate doc get pages             outfp write  START PAGE  d n    i          if page is not None              interpreter process page page          outfp write  END PAGE  d n    i       device close       fp close        return outfp getvalue

User · Answer

pyPDF works fine  assuming that you re working with well-formed PDFs    If all you want is the text  with spaces   you can just do   import pyPdf pdf   pyPdf PdfFileReader open filename   rb    for page in pdf pages      print page extractText     You can also easily get access to the metadata  image data  and so forth   A comment in the extractText code notes      Locate all text drawing commands  in   the order they are provided in the   content stream  and extract the text     This works well for some PDF files    but poorly for others  depending on   the generator used   This will be   refined in the future   Do not rely on   the order of text coming out of this   function  as it will change if this   function is made more sophisticated    Whether or not this is a problem depends on what you re doing with the text  e g  if the order doesn t matter  it s fine  or if the generator adds text to the stream in the order it will be displayed  it s fine    I have pyPdf extraction code in daily use  without any problems

User · Answer

Repurposing the pdf2txt py code that comes with pdfminer  you can make a function that will take a path to the pdf  optionally  an outtype  txt html xml tag  and opts like the commandline pdf2txt   -o     path to outfile txt         By default  you can call   convert pdf path    A text file will be created  a sibling on the filesystem to the original pdf   def convert pdf path  outtype  txt   opts          import sys     from pdfminer pdfinterp import PDFResourceManager  PDFPageInterpreter  process pdf     from pdfminer converter import XMLConverter  HTMLConverter  TextConverter  TagExtractor     from pdfminer layout import LAParams     from pdfminer pdfparser import PDFDocument  PDFParser     from pdfminer pdfdevice import PDFDevice     from pdfminer cmapdb import CMapDB      outfile   path  -3    outtype     outdir       join path split       -1        debug   0       input option     password          pagenos   set       maxpages   0       output option     codec    utf-8      pageno   1     scale   1     showpageno   True     laparams   LAParams       for  k  v  in opts          if k     -d   debug    1         elif k     -p   pagenos update  int x -1 for x in v split                elif k     -m   maxpages   int v          elif k     -P   password   v         elif k     -o   outfile   v         elif k     -n   laparams   None         elif k     -A   laparams all texts   True         elif k     -D   laparams writing mode   v         elif k     -M   laparams char margin   float v          elif k     -L   laparams line margin   float v          elif k     -W   laparams word margin   float v          elif k     -O   outdir   v         elif k     -t   outtype   v         elif k     -c   codec   v         elif k     -s   scale   float v            CMapDB debug   debug     PDFResourceManager debug   debug     PDFDocument debug   debug     PDFParser debug   debug     PDFPageInterpreter debug   debug     PDFDevice debug   debug           rsrcmgr   PDFResourceManager       if not outtype          outtype    txt          if outfile              if outfile endswith   htm   or outfile endswith   html                    outtype    html              elif outfile endswith   xml                    outtype    xml              elif outfile endswith   tag                    outtype    tag      if outfile          outfp   file outfile   w       else          outfp   sys stdout     if outtype     txt           device   TextConverter rsrcmgr  outfp  codec codec  laparams laparams      elif outtype     xml           device   XMLConverter rsrcmgr  outfp  codec codec  laparams laparams  outdir outdir      elif outtype     html           device   HTMLConverter rsrcmgr  outfp  codec codec  scale scale  laparams laparams  outdir outdir      elif outtype     tag           device   TagExtractor rsrcmgr  outfp  codec codec      else          return usage        fp   file path   rb       process pdf rsrcmgr  device  fp  pagenos  maxpages maxpages  password password      fp close       device close        outfp close       return

User · Answer

You can also quite easily use pdfminer as a library  You have access to the pdf s content model  and can create your own text extraction  I did this to convert pdf contents to semi-colon separated text  using the code below   The function simply sorts the TextItem content objects according to their y and x coordinates  and outputs items with the same y coordinate as one text line  separating the objects on the same line with     characters   Using this approach  I was able to extract text from a pdf that no other tool was able to extract content suitable for further parsing from  Other tools I tried include pdftotext  ps2ascii and the online tool pdftextonline com   pdfminer is an invaluable tool for pdf-scraping    def pdf to csv filename       from pdflib page import TextItem  TextConverter     from pdflib pdfparser import PDFDocument  PDFParser     from pdflib pdfinterp import PDFResourceManager  PDFPageInterpreter      class CsvConverter TextConverter           def   init   self   args    kwargs               TextConverter   init   self   args    kwargs           def end page self  i               from collections import defaultdict             lines   defaultdict lambda                   for child in self cur item objs                  if isinstance child  TextItem                            x y    child bbox                     line   lines int -y                       line x    child text              for y in sorted lines keys                     line   lines y                  self outfp write     join line x  for x in sorted line keys                      self outfp write   n              the following part of the code is a remix of the        convert   function in the pdfminer tools pdf2text module     rsrc   PDFResourceManager       outfp   StringIO       device   CsvConverter rsrc  outfp   ascii        doc   PDFDocument       fp   open filename   rb       parser   PDFParser doc  fp      doc initialize          interpreter   PDFPageInterpreter rsrc  device       for i  page in enumerate doc get pages             outfp write  START PAGE  d n    i          interpreter process page page          outfp write  END PAGE  d n    i       device close       fp close        return outfp getvalue      UPDATE   The code above is written against an old version of the API  see my comment below

User · Answer

I have used pdftohtml with the -xml argument  read the result with subprocess Popen    that will give you x coord  y coord  width  height  and font  of every snippet of text in the pdf  I think this is what  evince  probably uses too because the same error messages spew out    If you need to process columnar data  it gets slightly more complicated as you have to invent an algorithm that suits your pdf file  The problem is that the programs that make PDF files don t really necessarily lay out the text in any logical format  You can try simple sorting algorithms and it works sometimes  but there can be little  stragglers  and  strays   pieces of text that don t get put in the order you thought they would  So you have to get creative    It took me about 5 hours to figure out one for the pdf s I was working on  But it works pretty good now  Good luck

User · Answer

slate is a project that makes it very simple to use PDFMiner from a library    gt  gt  gt  with open  example pdf   as f         doc   slate PDF f       gt  gt  gt  doc                  gt  gt  gt  doc 1   Text from page 2

User · Answer

Pdftotext An open source program  part of Xpdf  which you could call from python  not what you asked for but might be useful   I ve used it with no problems  I think google use it in google desktop

User · Answer

Additionally there is PDFTextStream which is a commercial Java library that can also be used from Python

User · Answer

Since none for these solutions support the latest version of PDFMiner I wrote a simple solution that will return text of a pdf using PDFMiner  This will work for those who are getting import errors with process pdf   import sys from pdfminer pdfinterp import PDFResourceManager  PDFPageInterpreter from pdfminer pdfpage import PDFPage from pdfminer converter import XMLConverter  HTMLConverter  TextConverter from pdfminer layout import LAParams from cStringIO import StringIO  def pdfparser data        fp   file data   rb       rsrcmgr   PDFResourceManager       retstr   StringIO       codec    utf-8      laparams   LAParams       device   TextConverter rsrcmgr  retstr  codec codec  laparams laparams        Create a PDF interpreter object      interpreter   PDFPageInterpreter rsrcmgr  device        Process each page contained in the document       for page in PDFPage get pages fp           interpreter process page page          data    retstr getvalue        print data  if   name         main         pdfparser sys argv 1       See below code that works for Python 3   import sys from pdfminer pdfinterp import PDFResourceManager  PDFPageInterpreter from pdfminer pdfpage import PDFPage from pdfminer converter import XMLConverter  HTMLConverter  TextConverter from pdfminer layout import LAParams import io  def pdfparser data        fp   open data   rb       rsrcmgr   PDFResourceManager       retstr   io StringIO       codec    utf-8      laparams   LAParams       device   TextConverter rsrcmgr  retstr  codec codec  laparams laparams        Create a PDF interpreter object      interpreter   PDFPageInterpreter rsrcmgr  device        Process each page contained in the document       for page in PDFPage get pages fp           interpreter process page page          data    retstr getvalue        print data   if   name         main         pdfparser sys argv 1

User · Answer

PDFminer gave me perhaps one line  page 1 of 7     on every page of a pdf file I tried with it   The best answer I have so far is pdftoipe  or the c   code it s based on Xpdf   see my question for what the output of pdftoipe looks like

User · Answer

Try PDFMiner  It can extract text from PDF files as HTML  SGML or  Tagged PDF  format   The Tagged PDF format seems to be the cleanest  and stripping out the XML tags leaves just the bare text   A Python 3 version is available under    https   github com pdfminer pdfminer six

User · Answer

I needed to convert a specific PDF to plain text within a python module  I used PDFMiner 20110515  after reading through their pdf2txt py tool I wrote this simple snippet   from cStringIO import StringIO from pdfminer pdfinterp import PDFResourceManager  process pdf from pdfminer converter import TextConverter from pdfminer layout import LAParams  def to txt pdf path       input    file pdf path   rb       output   StringIO        manager   PDFResourceManager       converter   TextConverter manager  output  laparams LAParams        process pdf manager  converter  input        return output getvalue

[python] Python module for converting PDF to text

Examples related to python

Examples related to pdf

Examples related to text-extraction

Examples related to pdf-scraping