Parse HTML table to Python list

Question

I d like to take an HTML table and parse through it to get a list of dictionaries  Each list element would be a dictionary corresponding to a row in the table   If  for example  I had an HTML table with three columns  marked by header tags    Event    Start Date   and  End Date  and that table had 5 entries  I would like to parse through that table to get back a list of length 5 where each element is a dictionary with keys  Event    Start Date   and  End Date    Thanks for the help

User · Accepted Answer

You should use some HTML parsing library like lxml:

from lxml import etree
s = """<table>
  <tr><th>Event</th><th>Start Date</th><th>End Date</th></tr>
  <tr><td>a</td><td>b</td><td>c</td></tr>
  <tr><td>d</td><td>e</td><td>f</td></tr>
  <tr><td>g</td><td>h</td><td>i</td></tr>
</table>
"""
table = etree.HTML(s).find("body/table")
rows = iter(table)
headers = [col.text for col in next(rows)]
for row in rows:
    values = [col.text for col in row]
    print dict(zip(headers, values))

prints

{'End Date': 'c', 'Start Date': 'b', 'Event': 'a'}
{'End Date': 'f', 'Start Date': 'e', 'Event': 'd'}
{'End Date': 'i', 'Start Date': 'h', 'Event': 'g'}

User · Answer

Sven Marnach excellent solution is directly translatable into ElementTree which is part of recent Python distributions  from xml etree import ElementTree as ET  s    quot  quot  quot  lt table gt     lt tr gt  lt th gt Event lt  th gt  lt th gt Start Date lt  th gt  lt th gt End Date lt  th gt  lt  tr gt     lt tr gt  lt td gt a lt  td gt  lt td gt b lt  td gt  lt td gt c lt  td gt  lt  tr gt     lt tr gt  lt td gt d lt  td gt  lt td gt e lt  td gt  lt td gt f lt  td gt  lt  tr gt     lt tr gt  lt td gt g lt  td gt  lt td gt h lt  td gt  lt td gt i lt  td gt  lt  tr gt   lt  table gt   quot  quot  quot   table   ET XML s  rows   iter table  headers    col text for col in next rows   for row in rows      values    col text for col in row      print dict zip headers  values     same output as Sven Marnach s answer

User · Answer

If the HTML is not XML you can t do it with etree  But even then  you don t have to use an external library for parsing a HTML table  In python 3 you can reach your goal with HTMLParser from html parser  I ve the code of the simple derived HTMLParser class here in a github repo   You can use that class  here named HTMLTableParser  the following way   import urllib request from html table parser import HTMLTableParser  target    http   www twitter com     get website content req   urllib request Request url target  f   urllib request urlopen req  xhtml   f read   decode  utf-8      instantiate the parser and feed it p   HTMLTableParser   p feed xhtml  print p tables    The output of this is a list of 2D-lists representing tables  It looks maybe like this               Anmelden          Land    Code    F  r Kunden von        Vereinigte Staaten    40404     beliebig         Kanada    21212     beliebig               3424486444    Vodafone          Zeige SMS-Kurzwahlen f  r andere L  nder

User · Answer

Hands down the easiest way to parse a HTML table is to use pandas read html   - it accepts both URLs and HTML    import pandas as pd url   r https   en wikipedia org wiki List of S 26P 500 companies  tables   pd read html url    Returns list of all tables on page sp500 table   tables 0    Select table of interest   Only downside is that read html   doesn t preserve hyperlinks

[python] Parse HTML table to Python list?

Examples related to python

Examples related to html