Extract Identify Tables from PDF python

Question

Are there any open source libraries that support table identification  amp  extraction   By this I mean     Identify a table structure exists Classify the table from its contents Extract data from the table in a useful output format e g  JSON   CSV etc    I have looked through similar questions on this topic and found the following    PDFMiner which addresses problem 3  but it seems the user is required to specify to PDFMiner where a table structure exists for each table  correct me if I m wrong  pdf-table-extract which attempts to address problem 1 but according to the To-Do list  cannot currently identify tables that are separated by whitespace  This is a problem as all tables in my PDFs are separated by whitespace    Currently  I am thinking that I would have to spend a lot of time developing a Machine Learning solution to identify table structures from PDFs  Therefore  any alternative approaches would be more than welcome

User · Answer

I d just like to add to the very helpful answer from Kurt Pfeifle - there is now a Python wrapper for Tabula  and this seems to work very well so far  https   github com chezou tabula-py  This will convert your PDF table to a Pandas data frame   You can also set the area in x y co-ordinates which is obviously very handy for irregular data

User · Answer

After many fruitful hours of exploring OCR libraries  bounding boxes and clustering algorithms - I found a solution so simple it makes you want to cry   I hope you are using Linux   pdftotext -layout NAME OF PDF pdf  AMAZING    Now you have a nice text file with all the information lined up in nice columns  now it is trivial to format into a csv etc    It is for times like this that I love Linux  these guys came up with AMAZING solutions to everything  and put it there for FREE

User · Answer

You should definitely have a look at this answer of mine    Extracting table contents from a collection of PDF files   and also have a look at all the links included therein   Tabula TabulaPDF is currently the best table extraction tool that is available for PDF scraping

[python] Extract / Identify Tables from PDF python

Examples related to python

Examples related to pdf

Examples related to scrape

Examples related to pdf-parsing

Examples related to pdf-scraping