Parsing PDF files especially with tables with PDFBox

Question

I need to parse a PDF file which contains tabular data  I m using PDFBox to extract the file text to parse the result  String  later  The problem is that the text extraction doesn t work as I expected for tabular data  For example  I have a file which contains a table like this  7 columns  the first two always have data  only one Complexity column has data  only one Financing column has data     ----------------------------------------------------------------    AIH   Value   Complexity                       Financing                         Medium   High   Not applicable   MAC Other   FAE    ----------------------------------------------------------------    xyz   12 43   12 34                            12 34              ----------------------------------------------------------------    abc   1 56             1 56                                1 56   ----------------------------------------------------------------    Then I use PDFBox   PDDocument document   PDDocument load pathToFile   PDFTextStripper s   new PDFTextStripper    String content   s getText document     Those two lines of data would be extracted like this   xyz 12 43 12 4312 43 abc 1 56 1 561 56   There are no white spaces between the last two numbers  but this is not the biggest problem  The problem is that I don t know what the last two numbers mean  Medium  High  Not applicable  MAC Other  FAE  I don t have the relation between the numbers and their columns   It is not required for me to use the PDFBox library  so a solution that uses another library is fine  What I want is to be able to parse the file and know what each parsed number means

User · Answer

http   swftools org  these guys have a pdf2swf component  They are also able to show tables  They are also giving the source  So you could possibly check it out

User · Answer

You can use PDFBox s PDFTextStripperByArea class to extract text from a specific region of a document  You can build on this by identifying the region each cell of the table  This isn t provided out of the box  but the example DrawPrintTextLocations class demonstrates how you can parse the bounding boxes of individual characters in a document  it would be great to parse bounding boxes of strings or paragraphs  but I haven t seen support in PDFBox for this - see this question   You can use this approach to group up all touching bounding boxes to identify distinct cells of a table  One way to do this is to maintain a set boxes of Rectangle2D regions and then for each parsed character find the character s bounding box as in DrawPrintTextLocations writeString String string  List lt TextPosition gt  textPositions  and merge it with the existing contents   Rectangle2D bounds   s getBounds2D       Pad sides to detect almost touching boxes Rectangle2D hitbox   bounds getBounds2D    final double dx   1 0     This value works for me  feel free to tweak  or add setter  final double dy   0 000     Rows of text tend to overlap  so no need to extend hitbox add bounds getMinX   - dx   bounds getMinY   - dy   hitbox add bounds getMaxX     dx   bounds getMaxY     dy       Find all overlapping boxes List lt Rectangle2D gt  intersectList   new ArrayList lt Rectangle2D gt     for Rectangle2D box  boxes        if box intersects hitbox             intersectList add box               Combine all touching boxes and update for Rectangle2D box  intersectList        bounds add box       boxes remove box     boxes add bounds     You can then pass these regions to PDFTextStripperByArea   You can also go one further and separate out the horizontal and vertical components of these regions  and so infer regions of all the table s cells  regardless of whether then hold any content   I have had cause to perform these steps  and eventually wrote my own PDFTableStripper class using PDFBox   I ve shared my code as a gist on GitHub  The main method gives an example of how the class can be used   try  PDDocument document   PDDocument load new File args 0           final double res   72     PDF units are at 72 DPI     PDFTableStripper stripper   new PDFTableStripper        stripper setSortByPosition true           Choose a region in which to extract a table  here a 6 wide  9  high rectangle offset 1  from top left of page      stripper setRegion new Rectangle           int  Math round 1 0 res             int  Math round 1 res             int  Math round 6 res             int  Math round 9 0 res             Repeat for each page of PDF     for  int page   0  page  lt  document getNumberOfPages      page                System out println  Page     page           PDPage pdPage   document getPage page           stripper extractTable pdPage           for int c 0  c lt stripper getColumns      c                System out println  Column     c               for int r 0  r lt stripper getRows      r                    System out println  Row     r                   System out println stripper getText r  c

User · Answer

I ve had decent success with parsing text files generated by the pdftotext utility  sudo apt-get install poppler-utils    File convertPdf   throws Exception       File pdf   new File  mypdf pdf        String outfile    mytxt txt       String proc     usr bin pdftotext       ProcessBuilder pb   new ProcessBuilder proc  -layout  pdf getAbsolutePath   outfile        Process p   pb start         p waitFor         return new File outfile

User · Answer

consider using PDFTableStripper class The class is available on git   https   gist github com beldaz 8ed6e7473bd228fcee8d4a3e4525be11 file-pdftablestripper-java-L1

User · Answer

How about printing to image and doing OCR on that   Sounds terribly ineffective  but it s practically the very purpose of PDF to make text inaccessible  you gotta do what you gotta do

User · Answer

You will need to devise an algorithm to extract the data in a usable format  Regardless of which PDF library you use  you will need to do this  Characters and graphics are drawn by a series of stateful drawing operations  i e  move to this position on the screen and draw the glyph for character  c    I suggest that you extend org apache pdfbox pdfviewer PDFPageDrawer and override the strokePath method  From there you can intercept the drawing operations for horizontal and vertical line segments and use that information to determine the column and row positions for your table  Then its a simple matter of setting up text regions and determining which numbers letters characters are drawn in which region  Since you know the layout of the regions  you ll be able to tell which column the extracted text belongs to   Also  the reason you may not have spaces between text that is visually separated is that very often  a space character is not drawn by the PDF  Instead the text matrix is updated and a drawing command for  move  is issued to draw the next character and a  space width  apart from the last one   Good luck

User · Answer

I m not familiar with PDFBox  but you could try looking at itext  Even though the homepage says PDF generation  you can also do PDF manipulation and extraction  Have a look and see if it fits your use case

User · Answer

ObjectExtractor oe   new ObjectExtractor document    SpreadsheetExtractionAlgorithm sea   new SpreadsheetExtractionAlgorithm       Tabula algo   Page page   oe extract 1      extract only the first page  for  int y   0  y  lt  sea extract page  size    y        System out println  quot table   quot    y     Table table   sea extract page  get y      for  int i   0  i  lt  table getColCount    i          for  int x   0  x  lt  table getRowCount    x            System out println  quot col  quot    i    quot  lin x quot    x    quot   gt  gt  quot    table getCell x  i  getText

User · Answer

I had used many tools to extract table from pdf file but it didn t work for me   So i have implemented my own algorithm   its name is traprange   to parse tabular data in pdf files    Following are some sample pdf files and results     Input file  sample-1 pdf  result  sample-1 html Input file  sample-4 pdf  result  sample-4 html   Visit my project page at traprange

User · Answer

It may be too late for my answer  but I think this is not that hard  You can extend the PDFTextStripper class and override the writePage   and processTextPosition      methods  In your case I assume that the column headers are always the same  That means that you know the x-coordinate of each column heading and you can compare the the x-coordinate of the numbers to those of the column headings  If they are close enough  you have to test to decide how close  then you can say that that number belongs to that column   Another approach would be to intercept the  charactersByArticle  Vector after each page is written    Override public void writePage   throws IOException       super writePage        final Vector lt List lt TextPosition gt  gt  pageText   getCharactersByArticle          now you have all the characters on that page       to do what you want with them     Knowing your columns  you can do your comparison of the x-coordinates to decide what column every number belongs to   The reason you don t have any spaces between numbers is because you have to set the word separator string   I hope this is useful to you or to others who might be trying similar things

User · Answer

I had the same problem in reading the pdf file in which data is in tabular format  After regular parse using PDFBox each row were extracted with comma as a separator    losing the columnar position  To resolve this I used PDFTextStripperByArea and using coordinates I extracted the data column by column for each row  This is provided that you have a fixed format pdf           File file   new File  fileName pdf            PDDocument document   PDDocument load file           PDFTextStripperByArea stripper   new PDFTextStripperByArea            stripper setSortByPosition  true            Rectangle rect1   new Rectangle  50  140  60  20            Rectangle rect2   new Rectangle  110  140  20  20            stripper addRegion   row1column1   rect1            stripper addRegion   row1column2   rect2            List allPages   document getDocumentCatalog   getAllPages            PDPage firstPage    PDPage allPages get  2            stripper extractRegions  firstPage            System out println stripper getTextForRegion   row1column1              System out println stripper getTextForRegion   row1column2        Then row 2 and so on

User · Answer

There s PDFLayoutTextStripper that was designed to keep the format of the data   From the README   import java io FileInputStream  import java io FileNotFoundException  import java io IOException   import org apache pdfbox pdfparser PDFParser  import org apache pdfbox pdmodel PDDocument  import org apache pdfbox util PDFTextStripper   public class Test        public static void main String   args            String string   null          try               PDFParser pdfParser   new PDFParser new FileInputStream  sample pdf                 pdfParser parse                PDDocument pdDocument   new PDDocument pdfParser getDocument                 PDFTextStripper pdfTextStripper   new PDFLayoutTextStripper                string   pdfTextStripper getText pdDocument             catch  FileNotFoundException e                e printStackTrace              catch  IOException e                e printStackTrace                       System out println string

User · Answer

For reading content of the table from pdf file you have to do only just convert the pdf file into a text file by using any API I have use PdfTextExtracter getTextFromPage   of iText  and then read that txt file by your java program  now after reading it the major task is done   you have to filter the data of your need  you can do it by continuously using split method of String class until you find record of your intrest   here is my code by which I have extract part of record by an PDF file and write it into a  CSV file   Url of PDF file is  http   www cea nic in reports monthly generation rep actual jan13 opm 02 pdf  Code -  public static void genrateCsvMonth Region String pdfpath  String csvpath            try               String line   null                 Appending Header in CSV file                BufferedWriter writer1   new BufferedWriter new FileWriter csvpath                      true                writer1 close                   Checking whether file is empty or not               BufferedReader br   new BufferedReader new FileReader csvpath                 if   line   br readLine       null                    BufferedWriter writer   new BufferedWriter new FileWriter                          csvpath  true                    writer append  REGION                     writer append  YEAR                     writer append  MONTH                     writer append  THERMAL                     writer append  NUCLEAR                     writer append  HYDRO                     writer append  TOTAL n                    writer close                                 Reading the pdf file               PdfReader reader   new PdfReader pdfpath               BufferedWriter writer   new BufferedWriter new FileWriter csvpath                      true                    Extracting records from page into String               String page   PdfTextExtractor getTextFromPage reader  1                  Extracting month and Year from String               String period1     page split  PEROID                String period2     period1 0  split                   String month     period2 1  split  -                String period3     month 1  split  ENERGY                String year     period3 0  split  VIS                    Extracting Northen region             String northen     page split  NORTHEN REGION                String nthermal1     northen 0  split  THERMAL                String nthermal2     nthermal1 1  split                    String nnuclear1     northen 0  split  NUCLEAR                String nnuclear2     nnuclear1 1  split                    String nhydro1     northen 0  split  HYDRO                String nhydro2     nhydro1 1  split                    String ntotal1     northen 0  split  TOTAL                String ntotal2     ntotal1 1  split                       Appending filtered data into CSV file               writer append  NORTHEN                      writer append year 0                      writer append month 0                      writer append nthermal2 4                      writer append nnuclear2 4                      writer append nhydro2 4                      writer append ntotal2 4      n                    Extracting Western region             String western     page split  WESTERN                 String wthermal1     western 1  split  THERMAL                String wthermal2     wthermal1 1  split                    String wnuclear1     western 1  split  NUCLEAR                String wnuclear2     wnuclear1 1  split                    String whydro1     western 1  split  HYDRO                String whydro2     whydro1 1  split                    String wtotal1     western 1  split  TOTAL                String wtotal2     wtotal1 1  split                       Appending filtered data into CSV file               writer append  WESTERN                      writer append year 0                      writer append month 0                      writer append wthermal2 4                      writer append wnuclear2 4                      writer append whydro2 4                      writer append wtotal2 4      n                    Extracting Southern Region             String southern     page split  SOUTHERN                 String sthermal1     southern 1  split  THERMAL                String sthermal2     sthermal1 1  split                    String snuclear1     southern 1  split  NUCLEAR                String snuclear2     snuclear1 1  split                    String shydro1     southern 1  split  HYDRO                String shydro2     shydro1 1  split                    String stotal1     southern 1  split  TOTAL                String stotal2     stotal1 1  split                       Appending filtered data into CSV file               writer append  SOUTHERN                      writer append year 0                      writer append month 0                      writer append sthermal2 4                      writer append snuclear2 4                      writer append shydro2 4                      writer append stotal2 4      n                    Extracting eastern region             String eastern     page split  EASTERN                 String ethermal1     eastern 1  split  THERMAL                String ethermal2     ethermal1 1  split                    String ehydro1     eastern 1  split  HYDRO                String ehydro2     ehydro1 1  split                    String etotal1     eastern 1  split  TOTAL                String etotal2     etotal1 1  split                      Appending filtered data into CSV file               writer append  EASTERN                      writer append year 0                      writer append month 0                      writer append ethermal2 4                      writer append                         writer append ehydro2 4                      writer append etotal2 4      n                    Extracting northernEastern region             String neestern     page split  NORTH                 String nethermal1     neestern 2  split  THERMAL                String nethermal2     nethermal1 1  split                    String nehydro1     neestern 2  split  HYDRO                String nehydro2     nehydro1 1  split                    String netotal1     neestern 2  split  TOTAL                String netotal2     netotal1 1  split                    writer append  NORTH EASTERN                      writer append year 0                      writer append month 0                      writer append nethermal2 4                      writer append                         writer append nehydro2 4                      writer append netotal2 4      n                writer close               catch  IOException ioe                ioe printStackTrace

User · Answer

You can extract text by area in PDFBox  See the ExtractByArea java example file  in the pdfbox-examples artifact if you re using Maven  A snippet looks like     PDFTextStripperByArea stripper   new PDFTextStripperByArea       stripper setSortByPosition  true       Rectangle rect   new Rectangle  464  59  55  5      stripper addRegion   class1   rect       stripper extractRegions  page       String string   stripper getTextForRegion   class1       The problem is getting the coordinates in the first place  I ve had success extending the normal TextStripper  overriding processTextPosition TextPosition text  and printing out the coordinates for each character and figuring out where in the document they are   But there s a much simpler way  at least if you re on a Mac  Open the PDF in Preview   I to show the Inspector  choose the Crop tab and make sure the units are in Points  from the Tools menu choose Rectangular selection  and select the area of interest  If you select an area  the inspector will show you the coordinates  which you can round and feed into the Rectangle constructor arguments  You just need to confirm where the origin is  using the first method

User · Answer

It is not required for me to use the PDFBox library  so a solution that uses another library is fine  Camelot and Excalibur You may want to try Python library Camelot  an open source library for Python  If you are not inclined to write code  you may use the web interface Excalibur created around Camelot  You  quot upload quot  the document to a localhost web server  and  quot download quot  the result from this localhost server  Here is an example from using this python code  import camelot tables   camelot read pdf  foo pdf   flavor  quot stream quot   tables 0  to csv  foo csv    The input is a pdf containing this table   Sample table from the PDF-TREX set No help is provided to camelot  it is working on its own by looking at pieces of text relative alignment  The result is returned in a csv file   PDF table extracted from sample by camelot  quot Rules quot  can de added to help camelot identify where are fillets in sophisticated tables   Rule added in Excalibur  Source GitHub   Camelot  https   github com camelot-dev camelot Excalibur  https   github com camelot-dev excalibur  The two projects are active  Here is a comparison with other software  with test based on actual documents   Tabula  pdfplumber  pdftables  pdf-table-extract    I want is to be able to parse the file and know what each parsed number means  You cannot do that automatically  as pdf is not semantically structured  Book versus document Pdf  quot documents quot  are unstructured from a semantic standpoint  it s like a notepad file   the pdf document gives instructions on where to print a text fragment  unrelated to other fragments of the same section  there is no separation between content  what to print  and whether this is a fragment of a title  a table or a footnote  and the visual representation  font  location  etc   Pdf is an extension of PostScript  which describes a Hello world  page this way   PS   Courier               font  20 selectfont          size  72 500 moveto          current location to print at   Hello world   show    add text fragment  showpage               print all on the page   Wikipedia   One can imagine what a table looks like with the same instructions  We could say html is not clearer  however there is a big difference  Html describes the content semantically  title  paragraph  list  table header  table cell       and associates the css to produce a visual form  hence content is fully accessible  In this sense  html is a simplified descendant of sgml which puts constraints to allow data processing   Markup should describe a document s structure and other attributes rather than specify the processing that needs to be performed  because it is less likely to conflict with future developments   exactly the opposite of PostScript Pdf  SGML is used in publishing  Pdf doesn t embed this semantical structure  it carries only the css-equivalent associated to plain character strings which may not be complete words or sentences  Pdf is used for closed documents and now for the so-called workflow management  After having experimented the uncertainty and difficulty in trying to extract data from pdf  it s clear pdf is not at all a solution to preserve a document content for the future  in spite Adobe has obtained from their pairs a pdf standard   What is actually preserved well is the printed representation  as the pdf was fully dedicated to this aspect when created  Pdf are nearly as dead as printed books  When reusing the content matters  one must rely again on manual re-entering of data  like from a printed book  possibly trying to do some OCR on it   This is more and more true  as many pdf even prevent the use of copy-paste  introducing multiple spaces between words or produce an unordered characters gibberish when some  quot optimization quot  is done for web use  When the content of the document  not its printed representation  is valuable  then pdf is not the correct format  Even Adobe is unable to recreate perfectly the source of a document from its pdf rendering  So open data should never be released in pdf format  this limits their use to reading and printing  when allowed   and makes reuse harder or impossible

User · Answer

For anyone wanting to do the same thing as OP  as I do   after days of research Amazon Textract is the best option  if your volume is low free tier might be enough

User · Answer

This works fine if PDF file has  Only Rectangular table  using pdfbox 2 0 6  Won t work with any other table only Rectangular table   import java io File  import java io IOException  import java util ArrayList   import org apache pdfbox pdmodel PDDocument  import org apache pdfbox text PDFTextStripper  import org apache pdfbox text PDFTextStripperByArea  public class PDFTableExtractor       public static void main String   args  throws IOException           ArrayList lt String   gt  objTableList   readParaFromPDF  C   sample1 pdf   1 1 6             Enter Filepath  startPage  EndPage  Number of columns in Rectangular table           public static ArrayList lt String   gt  readParaFromPDF String pdfPath  int pageNoStart  int pageNoEnd  int noOfColumnsInTable            ArrayList lt String   gt  objArrayList   new ArrayList lt  gt             try               PDDocument document   PDDocument load new File pdfPath                document getClass                if   document isEncrypted                      PDFTextStripperByArea stripper   new PDFTextStripperByArea                    stripper setSortByPosition true                   PDFTextStripper tStripper   new PDFTextStripper                    tStripper setStartPage pageNoStart                   tStripper setEndPage pageNoEnd                   String pdfFileInText   tStripper getText document                      split by whitespace                 String Documentlines     pdfFileInText split    r   n                    for  String line   Documentlines                        String lineArr     line split    s                         if  lineArr length    noOfColumnsInTable                            for  String linedata   lineArr                                System out print linedata                                                                       System out println                              objArrayList add lineArr                                                                   catch  Exception e                System out println  Exception    e                         return objArrayList

User · Answer

Try using TabulaPDF  https   github com tabulapdf tabula    This is very good library to extract table content from the PDF file  It is very as expected   Good luck

User · Answer

Extracting data from PDF is bound to be fraught with problems  Are the documents created through some kind of automatic process  If so  you might consider converting the PDFs to uncompressed PostScript  try pdf2ps  and seeing if the PostScript contains some sort of regular pattern which you can exploit

[java] Parsing PDF files (especially with tables) with PDFBox

Examples related to java

Examples related to parsing

Examples related to pdf

Examples related to pdfbox

Examples related to tabular