Parsing XML with namespace in Python via ElementTree

Question

I have the following XML which I want to parse using Python s ElementTree    lt rdf RDF xml base  http   dbpedia org ontology       xmlns rdf  http   www w3 org 1999 02 22-rdf-syntax-ns       xmlns owl  http   www w3 org 2002 07 owl       xmlns xsd  http   www w3 org 2001 XMLSchema       xmlns rdfs  http   www w3 org 2000 01 rdf-schema       xmlns  http   dbpedia org ontology   gt        lt owl Class rdf about  http   dbpedia org ontology BasketballLeague  gt           lt rdfs label xml lang  en  gt basketball league lt  rdfs label gt           lt rdfs comment xml lang  en  gt            a group of sports teams that compete against each other           in Basketball          lt  rdfs comment gt       lt  owl Class gt    lt  rdf RDF gt    I want to find all owl Class tags and then extract the value of all rdfs label instances inside them  I am using the following code   tree   ET parse  filename   root   tree getroot   root findall  owl Class     Because of the namespace  I am getting the following error   SyntaxError  prefix  owl  not found in prefix map   I tried reading the document at http   effbot org zone element-namespaces htm but I am still not able to get this working since the above XML has multiple nested namespaces   Kindly let me know how to change the code to find all the owl Class tags

User · Answer

I ve been using similar code to this and have found it s always worth reading the documentation    as usual   findall   will only find elements which are direct children of the current tag  So  not really ALL   It might be worth your while trying to get your code working with the following  especially if you re dealing with big and complex xml files so that that sub-sub-elements  etc   are also included  If you know yourself where elements are in your xml  then I suppose it ll be fine  Just thought this was worth remembering   root iter     ref  https   docs python org 3 library xml etree elementtree html finding-interesting-elements  Element findall   finds only elements with a tag which are direct children of the current element  Element find   finds the first child with a particular tag  and Element text accesses the element   s text content  Element get   accesses the element   s attributes

User · Answer

To get the namespace in its namespace format  e g   myNameSpace   you can do the following   root   tree getroot   ns   re match r        root tag  group 0    This way  you can use it later on in your code to find nodes  e g using string interpolation  Python 3    link   root find f  ns link

User · Answer

ElementTree is not too smart about namespaces  You need to give the  find    findall   and iterfind   methods an explicit namespace dictionary  This is not documented very well   namespaces     owl    http   www w3 org 2002 07 owl      add more as needed  root findall  owl Class   namespaces    Prefixes are only looked up in the namespaces parameter you pass in  This means you can use any namespace prefix you like  the API splits off the owl  part  looks up the corresponding namespace URL in the namespaces dictionary  then changes the search to look for the XPath expression  http   www w3 org 2002 07 owl Class instead  You can use the same syntax yourself too of course   root findall   http   www w3 org 2002 07 owl  Class     If you can switch to the lxml library things are better  that library supports the same ElementTree API  but collects namespaces for you in a  nsmap attribute on elements

User · Answer

Note  This is an answer useful for Python s ElementTree standard library without using hardcoded namespaces   To extract namespace s prefixes and URI from XML data you can use ElementTree iterparse function  parsing only namespace start events  start-ns     gt  gt  gt  from io import StringIO  gt  gt  gt  from xml etree import ElementTree  gt  gt  gt  my schema   u    lt rdf RDF xml base  http   dbpedia org ontology           xmlns rdf  http   www w3 org 1999 02 22-rdf-syntax-ns           xmlns owl  http   www w3 org 2002 07 owl           xmlns xsd  http   www w3 org 2001 XMLSchema           xmlns rdfs  http   www w3 org 2000 01 rdf-schema           xmlns  http   dbpedia org ontology   gt                lt owl Class rdf about  http   dbpedia org ontology BasketballLeague  gt               lt rdfs label xml lang  en  gt basketball league lt  rdfs label gt               lt rdfs comment xml lang  en  gt                a group of sports teams that compete against each other               in Basketball              lt  rdfs comment gt           lt  owl Class gt            lt  rdf RDF gt      gt  gt  gt  my namespaces   dict           node for    node in ElementTree iterparse              StringIO my schema   events   start-ns                     gt  gt  gt  from pprint import pprint  gt  gt  gt  pprint my namespaces        http   dbpedia org ontology      owl    http   www w3 org 2002 07 owl      rdf    http   www w3 org 1999 02 22-rdf-syntax-ns      rdfs    http   www w3 org 2000 01 rdf-schema      xsd    http   www w3 org 2001 XMLSchema      Then the dictionary can be passed as argument to the search functions   root findall  owl Class   my namespaces

User · Answer

My solution is based on  Martijn Pieters  comment      register namespace only influences serialisation  not search    So the trick here is to use different dictionaries for serialization and for searching   namespaces              http   www example com default-schema        spec    http   www example com specialized-schema       Now  register all namespaces for parsing and writing   for name  value in namespaces iteritems        ET register namespace name  value    For searching  find    findall    iterfind    we need a non-empty prefix  Pass these functions a modified dictionary  here I modify the original dictionary  but this must be made only after the namespaces are registered    self namespaces  default     self namespaces       Now  the functions from the find   family can be used with the default prefix   print root find  default myelem   namespaces    but  tree write destination    does not use any prefixes for elements in the default namespace

User · Answer

Here s how to do this with lxml without having to hard-code the namespaces or scan the text for them  as Martijn Pieters mentions    from lxml import etree tree   etree parse  filename   root   tree getroot   root findall  owl Class   root nsmap    UPDATE   5 years later I m still running into variations of this issue   lxml helps as I showed above  but not in every case   The commenters may have a valid point regarding this technique when it comes merging documents  but I think most people are having difficulty simply searching documents   Here s another case and how I handled it    lt  xml version  1 0    gt  lt Tag1 xmlns  http   www mynamespace com prefix  gt   lt Tag2 gt content lt  Tag2 gt  lt  Tag1 gt    xmlns without a prefix means that unprefixed tags get this default namespace   This means when you search for Tag2  you need to include the namespace to find it   However  lxml creates an nsmap entry with None as the key  and I couldn t find a way to search for it   So  I created a new namespace dictionary like this  namespaces        response uses a default namespace  and tags don t mention it   create a new ns map using an identifier of our choice for k v in root nsmap iteritems        if not k          namespaces  myprefix     v e   root find  myprefix Tag2   namespaces

[python] Parsing XML with namespace in Python via 'ElementTree'

Examples related to python

Examples related to xml

Examples related to xml-parsing

Examples related to xml-namespaces

Examples related to elementtree