[python] How to convert an XML file to nice pandas dataframe?

Let's assume that I have an XML like this:

<author type="XXX" language="EN" gender="xx" feature="xx" web="foobar.com">
    <documents count="N">
        <document KEY="e95a9a6c790ecb95e46cf15bee517651" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
        <document KEY="bc360cfbafc39970587547215162f0db" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
        <document KEY="19e71144c50a8b9160b3f0955e906fce" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
        <document KEY="21d4af9021a174f61b884606c74d9e42" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
        <document KEY="28a45eb2460899763d709ca00ddbb665" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
        <document KEY="a0c0712a6a351f85d9f5757e9fff8946" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
        <document KEY="626726ba8d34d15d02b6d043c55fe691" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
        <document KEY="2cb473e0f102e2e4a40aa3006e412ae4" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...] [...]

I would like to read this XML file and convert it to a pandas DataFrame:

key                                         type     language    feature            web                         data
e95324a9a6c790ecb95e46cf15bE232ee517651      XXX        EN          xx      www.foo_bar_exmaple.com     A large text with lots of strings and punctuations symbols [...]
e95324a9a6c790ecb95e46cf15bE232ee517651      XXX        EN          xx      www.foo_bar_exmaple.com     A large text with lots of strings and punctuations symbols [...]
19e71144c50a8b9160b3cvdf2324f0955e906fce     XXX        EN          xx      www.foo_bar_exmaple.com     A large text with lots of strings and punctuations symbols [...]
21d4af9021a174f61b8erf284606c74d9e42         XXX        EN          xx      www.foo_bar_exmaple.com     A large text with lots of strings and punctuations symbols [...]
28a45eb2460823499763d70vdf9ca00ddbb665       XXX        EN          xx      www.foo_bar_exmaple.com     A large text with lots of strings and punctuations symbols [...]

This is what I already tried, but I am getting some errors and probably there is a more efficient way of doing this task:

from lxml import objectify
import pandas as pd

path = 'file_path'
xml = objectify.parse(open(path))
root = xml.getroot()
df = pd.DataFrame(columns=('key','type', 'language', 'feature', 'web', 'data'))

for i in range(0,len(xml)):
    obj = root.getchildren()[i].getchildren()
    row = dict(zip(['key','type', 'language', 'feature', 'web', 'data'], [obj[0].text, obj[1].text]))
    row_s = pd.Series(row)
    row_s.name = i
    df = df.append(row_s)

Could anybody provide me a better aproach for this problem?

This question is related to python xml python-2.7 parsing pandas

The answer is

Chiming in to recommend the use of the xmltodict library. It handled your xml text pretty well and I've used it for ingesting an xml file with almost a million records. xmltodict handling xml load

You can also convert by creating a dictionary of elements and then directly converting to a data frame:

import xml.etree.ElementTree as ET
import pandas as pd

# Contents of test.xml
# <?xml version="1.0" encoding="utf-8"?> <tags>   <row Id="1" TagName="bayesian" Count="4699" ExcerptPostId="20258" WikiPostId="20257" />   <row Id="2" TagName="prior" Count="598" ExcerptPostId="62158" WikiPostId="62157" />   <row Id="3" TagName="elicitation" Count="10" />   <row Id="5" TagName="open-source" Count="16" /> </tags>

root = ET.parse('test.xml').getroot()

tags = {"tags":[]}
for elem in root:
    tag = {}
    tag["Id"] = elem.attrib['Id']
    tag["TagName"] = elem.attrib['TagName']
    tag["Count"] = elem.attrib['Count']
    tags["tags"]. append(tag)

df_users = pd.DataFrame(tags["tags"])

Here is another way of converting a xml to pandas data frame. For example i have parsing xml from a string but this logic holds good from reading file as well.

import pandas as pd
import xml.etree.ElementTree as ET

xml_str = '<?xml version="1.0" encoding="utf-8"?>\n<response>\n <head>\n  <code>\n   200\n  </code>\n </head>\n <body>\n  <data id="0" name="All Categories" t="2018052600" tg="1" type="category"/>\n  <data id="13" name="RealEstate.com.au [H]" t="2018052600" tg="1" type="publication"/>\n </body>\n</response>'

etree = ET.fromstring(xml_str)
dfcols = ['id', 'name']
df = pd.DataFrame(columns=dfcols)

for i in etree.iter(tag='data'):
    df = df.append(
        pd.Series([i.get('id'), i.get('name')], index=dfcols),


