[python] Parsing XML with namespace in Python via 'ElementTree'

I have the following XML which I want to parse using Python's ElementTree:

<rdf:RDF xml:base="http://dbpedia.org/ontology/"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:owl="http://www.w3.org/2002/07/owl#"
    xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
    xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
    xmlns="http://dbpedia.org/ontology/">

    <owl:Class rdf:about="http://dbpedia.org/ontology/BasketballLeague">
        <rdfs:label xml:lang="en">basketball league</rdfs:label>
        <rdfs:comment xml:lang="en">
          a group of sports teams that compete against each other
          in Basketball
        </rdfs:comment>
    </owl:Class>

</rdf:RDF>

I want to find all owl:Class tags and then extract the value of all rdfs:label instances inside them. I am using the following code:

tree = ET.parse("filename")
root = tree.getroot()
root.findall('owl:Class')

Because of the namespace, I am getting the following error.

SyntaxError: prefix 'owl' not found in prefix map

I tried reading the document at http://effbot.org/zone/element-namespaces.htm but I am still not able to get this working since the above XML has multiple nested namespaces.

Kindly let me know how to change the code to find all the owl:Class tags.

This question is related to python xml xml-parsing xml-namespaces elementtree

The answer is


My solution is based on @Martijn Pieters' comment:

register_namespace only influences serialisation, not search.

So the trick here is to use different dictionaries for serialization and for searching.

namespaces = {
    '': 'http://www.example.com/default-schema',
    'spec': 'http://www.example.com/specialized-schema',
}

Now, register all namespaces for parsing and writing:

for name, value in namespaces.iteritems():
    ET.register_namespace(name, value)

For searching (find(), findall(), iterfind()) we need a non-empty prefix. Pass these functions a modified dictionary (here I modify the original dictionary, but this must be made only after the namespaces are registered).

self.namespaces['default'] = self.namespaces['']

Now, the functions from the find() family can be used with the default prefix:

print root.find('default:myelem', namespaces)

but

tree.write(destination)

does not use any prefixes for elements in the default namespace.


Here's how to do this with lxml without having to hard-code the namespaces or scan the text for them (as Martijn Pieters mentions):

from lxml import etree
tree = etree.parse("filename")
root = tree.getroot()
root.findall('owl:Class', root.nsmap)

UPDATE:

5 years later I'm still running into variations of this issue. lxml helps as I showed above, but not in every case. The commenters may have a valid point regarding this technique when it comes merging documents, but I think most people are having difficulty simply searching documents.

Here's another case and how I handled it:

<?xml version="1.0" ?><Tag1 xmlns="http://www.mynamespace.com/prefix">
<Tag2>content</Tag2></Tag1>

xmlns without a prefix means that unprefixed tags get this default namespace. This means when you search for Tag2, you need to include the namespace to find it. However, lxml creates an nsmap entry with None as the key, and I couldn't find a way to search for it. So, I created a new namespace dictionary like this

namespaces = {}
# response uses a default namespace, and tags don't mention it
# create a new ns map using an identifier of our choice
for k,v in root.nsmap.iteritems():
    if not k:
        namespaces['myprefix'] = v
e = root.find('myprefix:Tag2', namespaces)

I've been using similar code to this and have found it's always worth reading the documentation... as usual!

findall() will only find elements which are direct children of the current tag. So, not really ALL.

It might be worth your while trying to get your code working with the following, especially if you're dealing with big and complex xml files so that that sub-sub-elements (etc.) are also included. If you know yourself where elements are in your xml, then I suppose it'll be fine! Just thought this was worth remembering.

root.iter()

ref: https://docs.python.org/3/library/xml.etree.elementtree.html#finding-interesting-elements "Element.findall() finds only elements with a tag which are direct children of the current element. Element.find() finds the first child with a particular tag, and Element.text accesses the element’s text content. Element.get() accesses the element’s attributes:"


To get the namespace in its namespace format, e.g. {myNameSpace}, you can do the following:

root = tree.getroot()
ns = re.match(r'{.*}', root.tag).group(0)

This way, you can use it later on in your code to find nodes, e.g using string interpolation (Python 3).

link = root.find(f"{ns}link")

Note: This is an answer useful for Python's ElementTree standard library without using hardcoded namespaces.

To extract namespace's prefixes and URI from XML data you can use ElementTree.iterparse function, parsing only namespace start events (start-ns):

>>> from io import StringIO
>>> from xml.etree import ElementTree
>>> my_schema = u'''<rdf:RDF xml:base="http://dbpedia.org/ontology/"
...     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
...     xmlns:owl="http://www.w3.org/2002/07/owl#"
...     xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
...     xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
...     xmlns="http://dbpedia.org/ontology/">
... 
...     <owl:Class rdf:about="http://dbpedia.org/ontology/BasketballLeague">
...         <rdfs:label xml:lang="en">basketball league</rdfs:label>
...         <rdfs:comment xml:lang="en">
...           a group of sports teams that compete against each other
...           in Basketball
...         </rdfs:comment>
...     </owl:Class>
... 
... </rdf:RDF>'''
>>> my_namespaces = dict([
...     node for _, node in ElementTree.iterparse(
...         StringIO(my_schema), events=['start-ns']
...     )
... ])
>>> from pprint import pprint
>>> pprint(my_namespaces)
{'': 'http://dbpedia.org/ontology/',
 'owl': 'http://www.w3.org/2002/07/owl#',
 'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
 'rdfs': 'http://www.w3.org/2000/01/rdf-schema#',
 'xsd': 'http://www.w3.org/2001/XMLSchema#'}

Then the dictionary can be passed as argument to the search functions:

root.findall('owl:Class', my_namespaces)

Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to xml

strange error in my Animation Drawable How do I POST XML data to a webservice with Postman? PHP XML Extension: Not installed How to add a Hint in spinner in XML Generating Request/Response XML from a WSDL Manifest Merger failed with multiple errors in Android Studio How to set menu to Toolbar in Android How to add colored border on cardview? Android: ScrollView vs NestedScrollView WARNING: Exception encountered during context initialization - cancelling refresh attempt

Examples related to xml-parsing

How to create XML file with specific structure in Java jQuery xml error ' No 'Access-Control-Allow-Origin' header is present on the requested resource.' SyntaxError of Non-ASCII character xml.LoadData - Data at the root level is invalid. Line 1, position 1 XML Error: Extra content at the end of the document How to use sed to extract substring How to fix Invalid byte 1 of 1-byte UTF-8 sequence The best node module for XML parsing Parsing XML with namespace in Python via 'ElementTree' oracle plsql: how to parse XML and insert into table

Examples related to xml-namespaces

Parsing XML with namespace in Python via 'ElementTree' cvc-elt.1: Cannot find the declaration of element 'MyElement' Why this line xmlns:android="http://schemas.android.com/apk/res/android" must be the first in the layout xml file? What does elementFormDefault do in XSD? What does "xmlns" in XML mean? How to serialize an object to XML without getting xmlns="..."?

Examples related to elementtree

Use xml.etree.ElementTree to print nicely formatted xml files Convert Python ElementTree to string Parsing XML with namespace in Python via 'ElementTree' ParseError: not well-formed (invalid token) using cElementTree Parsing XML in Python using ElementTree example