[python] How to extract table as text from the PDF using Python?

If your pdf is text-based and not a scanned document (i.e. if you can click and drag to select text in your table in a PDF viewer), then you can use the module camelot-py with

import camelot
tables = camelot.read_pdf('foo.pdf')

You then can choose how you want to save the tables (as csv, json, excel, html, sqlite), and whether the output should be compressed in a ZIP archive.

tables.export('foo.csv', f='csv', compress=False)

Edit: tabula-py appears roughly 6 times faster than camelot-py so that should be used instead.

import camelot
import cProfile
import pstats
import tabula

cmd_tabula = "tabula.read_pdf('table.pdf', pages='1', lattice=True)"
prof_tabula = cProfile.Profile().run(cmd_tabula)
time_tabula = pstats.Stats(prof_tabula).total_tt

cmd_camelot = "camelot.read_pdf('table.pdf', pages='1', flavor='lattice')"
prof_camelot = cProfile.Profile().run(cmd_camelot)
time_camelot = pstats.Stats(prof_camelot).total_tt

print(time_tabula, time_camelot, time_camelot/time_tabula)

gave

1.8495559890000015 11.057014036000016 5.978199147125147