Are there any open source libraries that support table identification & extraction?
By this I mean:
I have looked through similar questions on this topic and found the following:
Currently, I am thinking that I would have to spend a lot of time developing a Machine Learning solution to identify table structures from PDFs. Therefore, any alternative approaches would be more than welcome!
This question is related to
python
pdf
scrape
pdf-parsing
pdf-scraping
I'd just like to add to the very helpful answer from Kurt Pfeifle - there is now a Python wrapper for Tabula, and this seems to work very well so far: https://github.com/chezou/tabula-py
This will convert your PDF table to a Pandas data frame. You can also set the area in x,y co-ordinates which is obviously very handy for irregular data.
After many fruitful hours of exploring OCR libraries, bounding boxes and clustering algorithms - I found a solution so simple it makes you want to cry!
I hope you are using Linux;
pdftotext -layout NAME_OF_PDF.pdf
AMAZING!!
Now you have a nice text file with all the information lined up in nice columns, now it is trivial to format into a csv etc..
It is for times like this that I love Linux, these guys came up with AMAZING solutions to everything, and put it there for FREE!
Source: Stackoverflow.com