[java] Parsing PDF files (especially with tables) with PDFBox

You can use PDFBox's PDFTextStripperByArea class to extract text from a specific region of a document. You can build on this by identifying the region each cell of the table. This isn't provided out of the box, but the example DrawPrintTextLocations class demonstrates how you can parse the bounding boxes of individual characters in a document (it would be great to parse bounding boxes of strings or paragraphs, but I haven't seen support in PDFBox for this - see this question). You can use this approach to group up all touching bounding boxes to identify distinct cells of a table. One way to do this is to maintain a set boxes of Rectangle2D regions and then for each parsed character find the character's bounding box as in DrawPrintTextLocations.writeString(String string, List<TextPosition> textPositions) and merge it with the existing contents.

Rectangle2D bounds = s.getBounds2D();
// Pad sides to detect almost touching boxes
Rectangle2D hitbox = bounds.getBounds2D();
final double dx = 1.0; // This value works for me, feel free to tweak (or add setter)
final double dy = 0.000; // Rows of text tend to overlap, so no need to extend
hitbox.add(bounds.getMinX() - dx , bounds.getMinY() - dy);
hitbox.add(bounds.getMaxX() + dx , bounds.getMaxY() + dy);

// Find all overlapping boxes
List<Rectangle2D> intersectList = new ArrayList<Rectangle2D>();
for(Rectangle2D box: boxes) {
    if(box.intersects(hitbox)) {
        intersectList.add(box);
    }
}

// Combine all touching boxes and update
for(Rectangle2D box: intersectList) {
    bounds.add(box);
    boxes.remove(box);
}
boxes.add(bounds);

You can then pass these regions to PDFTextStripperByArea.

You can also go one further and separate out the horizontal and vertical components of these regions, and so infer regions of all the table's cells, regardless of whether then hold any content.

I have had cause to perform these steps, and eventually wrote my own PDFTableStripper class using PDFBox. I've shared my code as a gist on GitHub. The main method gives an example of how the class can be used:

try (PDDocument document = PDDocument.load(new File(args[0])))
{
    final double res = 72; // PDF units are at 72 DPI
    PDFTableStripper stripper = new PDFTableStripper();
    stripper.setSortByPosition(true);

    // Choose a region in which to extract a table (here a 6"wide, 9" high rectangle offset 1" from top left of page)
    stripper.setRegion(new Rectangle(
        (int) Math.round(1.0*res), 
        (int) Math.round(1*res), 
        (int) Math.round(6*res), 
        (int) Math.round(9.0*res)));

    // Repeat for each page of PDF
    for (int page = 0; page < document.getNumberOfPages(); ++page)
    {
        System.out.println("Page " + page);
        PDPage pdPage = document.getPage(page);
        stripper.extractTable(pdPage);
        for(int c=0; c<stripper.getColumns(); ++c) {
            System.out.println("Column " + c);
            for(int r=0; r<stripper.getRows(); ++r) {
                System.out.println("Row " + r);
                System.out.println(stripper.getText(r, c));
            }
        }
    }
}

Examples related to java

Under what circumstances can I call findViewById with an Options Menu / Action Bar item? How much should a function trust another function How to implement a simple scenario the OO way Two constructors How do I get some variable from another class in Java? this in equals method How to split a string in two and store it in a field How to do perspective fixing? String index out of range: 4 My eclipse won't open, i download the bundle pack it keeps saying error log

Examples related to parsing

Got a NumberFormatException while trying to parse a text file for objects Uncaught SyntaxError: Unexpected end of JSON input at JSON.parse (<anonymous>) Python/Json:Expecting property name enclosed in double quotes Correctly Parsing JSON in Swift 3 How to get response as String using retrofit without using GSON or any other library in android UIButton action in table view cell "Expected BEGIN_OBJECT but was STRING at line 1 column 1" How to convert an XML file to nice pandas dataframe? How to extract multiple JSON objects from one file? How to sum digits of an integer in java?

Examples related to pdf

ImageMagick security policy 'PDF' blocking conversion How to extract table as text from the PDF using Python? Extract a page from a pdf as a jpeg How can I read pdf in python? Generating a PDF file from React Components Extract Data from PDF and Add to Worksheet How to extract text from a PDF file? How to download PDF automatically using js? Download pdf file using jquery ajax Generate PDF from HTML using pdfMake in Angularjs

Examples related to pdfbox

How to get raw text from pdf file using java How to merge two PDF files into one in Java? Parsing PDF files (especially with tables) with PDFBox

Examples related to tabular

Parsing PDF files (especially with tables) with PDFBox LaTeX table too wide. How to make it fit? Output in a table format in Java's System.out How to center cell contents of a LaTeX table whose columns have fixed widths?