I have a PDF which contains Tables, text and some images. I want to extract the table wherever tables are there in the PDF.
Right now am doing manually to find the Table from the page. From there I am capturing that page and saving into another PDF.
import PyPDF2
PDFfilename = "Sammamish.pdf" #filename of your PDF/directory where your PDF is stored
pfr = PyPDF2.PdfFileReader(open(PDFfilename, "rb")) #PdfFileReader object
pg4 = pfr.getPage(126) #extract pg 127
writer = PyPDF2.PdfFileWriter() #create PdfFileWriter object
#add pages
writer.addPage(pg4)
NewPDFfilename = "allTables.pdf" #filename of your PDF/directory where you want your new PDF to be
with open(NewPDFfilename, "wb") as outputStream:
writer.write(outputStream) #write pages to new PDF
My goal is to extract the table from the whole PDF document.
This question is related to
python
pdf
pdf-parsing
If your pdf is text-based and not a scanned document (i.e. if you can click and drag to select text in your table in a PDF viewer), then you can use the module camelot-py
with
import camelot
tables = camelot.read_pdf('foo.pdf')
You then can choose how you want to save the tables (as csv, json, excel, html, sqlite), and whether the output should be compressed in a ZIP archive.
tables.export('foo.csv', f='csv', compress=False)
Edit: tabula-py
appears roughly 6 times faster than camelot-py
so that should be used instead.
import camelot
import cProfile
import pstats
import tabula
cmd_tabula = "tabula.read_pdf('table.pdf', pages='1', lattice=True)"
prof_tabula = cProfile.Profile().run(cmd_tabula)
time_tabula = pstats.Stats(prof_tabula).total_tt
cmd_camelot = "camelot.read_pdf('table.pdf', pages='1', flavor='lattice')"
prof_camelot = cProfile.Profile().run(cmd_camelot)
time_camelot = pstats.Stats(prof_camelot).total_tt
print(time_tabula, time_camelot, time_camelot/time_tabula)
gave
1.8495559890000015 11.057014036000016 5.978199147125147
This is my code for extracting pdf.
import pandas as pd
import tabula
file = "filename.pdf"
path = 'enter your directory path here' + file
df = tabula.read_pdf(path, pages = '1', multiple_tables = True)
print(df)
Please refer to this repo of mine for more details.
Source: Stackoverflow.com