How to extract text from a PDF

Question

Can anyone recommend a library API for extracting the text and images from a PDF  We need to be able to get at text that is contained in pre-known regions of the document  so the API will need to give us positional information of each element on the page     We would like that data to be output in xml or json format   We re currently looking at PdfTextStream which seems pretty good  but would like to hear other peoples experiences and suggestions   Are there alternatives  commercial ones or free  for extracting text from a pdf programatically

User · Answer

Here is my suggestion  If you want to extract text from PDF  you could import the pdf file into Google Docs  then export it to a more friendly format such as  html   odf   rtf   txt  etc  All of this using the Drive API  It is free  and robust  Take a look at   https   developers google com drive v2 reference files insert https   developers google com drive v2 reference files get  Because it is a rest API  it is compatible with ALL programing languages  The links I posted aboove have working examples for many languages including  Java   NET  Python  PHP  Ruby  and others   I hope it helps

User · Answer

An efficient command line tool  open source  free of any fee  available on both linux  amp  windows   simply named pdftotext  This tool is a part of the xpdf library    http   en wikipedia org wiki Pdftotext

User · Answer

Since today I know it  the best thing for text extraction from PDFs is TET  the text extraction toolkit  TET is part of the PDFlib com family of products    PDFlib com is Thomas Merz s company  In case you don t recognize his name  Thomas Merz is the author of the  PostScript and PDF Bible    TET s first incarnation is a library  That one can probably do everything Budda006 wanted  including positional information about every element on the page  Oh  and it can also extract images  It recombines images which are fragmented into pieces   pdflib com also offers another incarnation of this technology  the TET plugin for Acrobat   And the third incarnation is the PDFlib TET iFilter  This is a standalone tool for user desktops  Both these are free  as in beer  to use for private  non-commercial purposes   And it s really powerful  Way better than Adobe s own text extraction  It extracted text for me where other tools  including Adobe s  do spit out garbage only   I just tested the desktop standalone tool  and what they say on their webpage is true  It has a very good commandline  Some of my  problematic  PDF test files the tool handled to my full satisfaction    This thing will from now on be my recommendation for every sophisticated and challenging PDF text extraction requirements   TET is simply awesome  It detects tables  Inside tables  it identifies cells spanning multiple columns  It identifies table rows and contents of each table cell separately  It deals very well with hyphenations  it removes hyphens and restores complete words  It supports non-ASCII languages  including CJK  Arabic and Hebrew   When encountering ligatures  it restores the original characters     Give it a try

User · Answer

The best thing I can currently think of  within the list of  simple  tools  is Ghostscript  current version is v 8 71  and the PostScript utility program ps2ascii ps  Ghostscript ships it in its lib subdirectory  Try this  on Windows    gswin32c exe      -q      -sFONTPATH c  windows fonts      -dNODISPLAY      -dSAFER      -dDELAYBIND      -dWRITESYSTEMDICT      -dCOMPLEX      -f ps2ascii ps      -dFirstPage 3      -dLastPage 7      input pdf      -dQUIET      -c quit   This command processes pages 3-7 of input pdf  Read the comments in the ps2ascii ps file itself to see what the  weird  numbers and additional infos mean  they indicate strings  positions  widths  colors  pictures  rectangles  fonts and page breaks      To get a  simple  text output  replace the -dCOMPLEX part by -dSIMPLE

User · Answer

PdfTextStream  which you said you have been looking at  is now free for single threaded applications  In my opinion its quality is much better than other libraries  esp  for things like funky embedded fonts  etc    Alternatively  you should have a look at Apache PDFBox  open source

User · Answer

One of the comments here used gs on Windows  I had some success with that on Linux OSX too  with the following syntax   gs    -q    -dNODISPLAY    -dSAFER    -dDELAYBIND    -dWRITESYSTEMDICT    -dSIMPLE    -f ps2ascii ps       input      -dQUIET    -c quit   I used dSIMPLE instead of dCOMPLEX because the latter outputs 1 character per line

User · Answer

Docotic Pdf library may be used to extract text from PDF files as plain text or as a collection of text chunks with coordinates for each chunk  Docotic Pdf can be used to extract images from PDFs  too  Disclaimer  I work for Bit Miracle

User · Answer

QuickPDF seems to be a reasonable library that should do what you want for a reasonable price   http   www quickpdflibrary com   - They have a 30 day trial

User · Answer

On my Macintosh systems  I find that  Adobe Reader  does a reasonably good job   I created an alias on my Desktop that points to the  Adobe Reader app   and all I do is drop a pdf-file on the alias  which makes it the active document in Adobe Reader  and then from the File-menu  I choose  Save as Text      give it a name and where to save it  click  Save   and I m done

User · Answer

Apache pdfbox has this feature - the text part is described in   http   pdfbox apache org apidocs org apache pdfbox util PDFTextStripper html  for an example implementation see https   github com WolfgangFahl pdfindexer  the testcase TestPdfIndexer testExtracting shows how it works

User · Answer

As the question is specifically about alternative tools to get data from PDF as XML so you may be interested to take a look at the commercial tool  ByteScout PDF Extractor SDK  that is capable of doing exactly this  extract text from PDF as XML along with the positioning data  x y  and font information   Text in the source PDF     Products   Units   Price      Output XML       lt row gt    lt column gt     lt text fontName  Arial  fontSize  11 0  fontStyle  Bold  x  212  y  126  width  47  height  11  gt Products lt  text gt      lt  column gt    lt column gt     lt text fontName  Arial  fontSize  11 0  fontStyle  Bold  x  428  y  126  width  27  height  11  gt Units lt  text gt      lt  column gt    lt column gt     lt text fontName  Arial  fontSize  11 0  fontStyle  Bold  x  503  y  126  width  26  height  11  gt Price lt  text gt      lt  column gt   lt  row gt      P S   additionally it also breaks the text into a table based structure   Disclosure  I work for ByteScout

User · Answer

I was given a 400 page pdf file with a table of data that I had to import - luckily no images  Ghostscript worked for me    gswin64c -sDEVICE txtwrite -o output txt input pdf  The output file was split into pages with headers  etc   but it was then easy to write an app to strip out blank lines  etc  and suck in all 30 000 records  -dSIMPLE and -dCOMPLEX made no difference in this case

User · Answer

I know that this topic is quite old  but this need is still alive  I read many documents  forum and script and build a new advanced one which supports compressed and uncompressed pdf    https   gist github com smalot 6183152  In some cases  command line is forbidden for security reasons  So a native PHP class can fit many needs   Hope it helps everone

User · Answer

For python  there is PDFMiner and pyPDF2  For more information on these  see Python module for converting PDF to text

User · Answer

For image extraction  pdfimages is a free command line tool for Linux or Windows  win32    pdfimages  Extract and Save Images From A Portable Document Format   PDF   File

[pdf] How to extract text from a PDF?

Examples related to pdf

Examples related to text

Examples related to ghostscript

Examples related to extraction

Examples related to text-extraction