How to parse XML and count instances of a particular node attribute

Question

I have many rows in a database that contains XML and I m trying to write a Python script to count instances of a particular node attribute  My tree looks like   lt foo gt      lt bar gt         lt type foobar  quot 1 quot   gt         lt type foobar  quot 2 quot   gt      lt  bar gt   lt  foo gt   How can I access the attributes  quot 1 quot  and  quot 2 quot  in the XML using Python

User · Answer

There are many options out there  cElementTree looks excellent if speed and memory usage are an issue  It has very little overhead compared to simply reading in the file using readlines   The relevant metrics can be found in the table below  copied from the cElementTree website   library                         time    space xml dom minidom  Python 2 1     6 3 s   80000K gnosis objectify                2 0 s   22000k xml dom minidom  Python 2 4     1 4 s   53000k ElementTree 1 2                 1 6 s   14500k   ElementTree 1 2 4 1 3           1 1 s   14500k   cDomlette  C extension          0 540 s 20500k PyRXPU  C extension             0 175 s 10850k libxml2  C extension            0 098 s 16000k readlines  read as utf-8        0 093 s 8850k cElementTree  C extension   -- gt  0 047 s 4900K  lt -- readlines  read as ascii        0 032 s 5050k      As pointed out by  jfs  cElementTree comes bundled with Python    Python 2  from xml etree import cElementTree as ElementTree  Python 3  from xml etree import ElementTree  the accelerated C version is used automatically

User · Answer

Python has an interface to the expat XML parser   xml parsers expat   It s a non-validating parser  so bad XML will not be caught  But if you know your file is correct  then this is pretty good  and you ll probably get the exact info you want and you can discard the rest on the fly   stringofxml       lt foo gt       lt bar gt           lt type arg  value    gt           lt type arg  value    gt           lt type arg  value    gt       lt  bar gt       lt bar gt           lt type arg  value    gt       lt  bar gt   lt  foo gt     count   0 def start name  attr       global count     if name     type           count    1  p   expat ParserCreate   p StartElementHandler   start p Parse stringofxml   print count   prints 4

User · Answer

You can use BeautifulSoup   from bs4 import BeautifulSoup  x     lt foo gt      lt bar gt         lt type foobar  1   gt         lt type foobar  2   gt      lt  bar gt   lt  foo gt      y BeautifulSoup x   gt  gt  gt  y foo bar type  foobar   u 1    gt  gt  gt  y foo bar findAll  type     lt type foobar  1  gt  lt  type gt    lt type foobar  2  gt  lt  type gt     gt  gt  gt  y foo bar findAll  type   0   foobar   u 1   gt  gt  gt  y foo bar findAll  type   1   foobar   u 2

User · Answer

import xml etree ElementTree as ET data       lt foo gt              lt bar gt                  lt type foobar  1   gt                  lt type foobar  2   gt             lt  bar gt          lt  foo gt     tree   ET fromstring data  lst   tree findall  bar type   for item in lst      print item get  foobar     This will print the value of the foobar attribute

User · Answer

minidom is the quickest and pretty straight forward   XML    lt data gt       lt items gt           lt item name  item1  gt  lt  item gt           lt item name  item2  gt  lt  item gt           lt item name  item3  gt  lt  item gt           lt item name  item4  gt  lt  item gt       lt  items gt   lt  data gt    Python   from xml dom import minidom xmldoc   minidom parse  items xml   itemlist   xmldoc getElementsByTagName  item   print len itemlist   print itemlist 0  attributes  name   value  for s in itemlist      print s attributes  name   value    Output   4 item1 item1 item2 item3 item4

User · Answer

Here a very simple but effective code using cElementTree    try      import cElementTree as ET except ImportError    try        Python 2 5 need to import a different module     import xml etree cElementTree as ET   except ImportError      exit err  Failed to import cElementTree from any known place          def find in tree tree  node       found   tree find node      if found    None          print  No  s in file    node         found          return found      Parse a xml file  specify the path  def file    xml file name xml  try      dom   ET parse open def file   r        root   dom getroot   except      exit err  Unable to open and parse input definition file      def file     Parse to find the child nodes list of node  myNode  fwdefs   find in tree root  myNode     This is from  python xml parse

User · Answer

xml etree ElementTree vs  lxml  These are some pros of the two most used libraries I would have benefit to know before choosing between them   xml etree ElementTree    From the standard library  no needs of installing any module   lxml   Easily write XML declaration  for instance do you need to add standalone  no    Pretty printing  you can have a nice indented XML without extra code   Objectify functionality  It allows you to use XML as if you were dealing with a normal Python object hierarchy node  sourceline allows to easily get the line of the XML element you are using  you can use also a built-in XSD schema checker

User · Answer

A new lib  I fell in love with it after I used it  I recommend it to you  from simplified scrapy import SimplifiedDoc xml        lt foo gt      lt bar gt         lt type foobar  quot 1 quot   gt         lt type foobar  quot 2 quot   gt      lt  bar gt   lt  foo gt       doc   SimplifiedDoc xml  types   doc selects  bar gt type   print  len types     2 print  types foobar      1    2   print  doc selects  bar gt type gt foobar          1    2    Here are more examples  This lib is easy to use

User · Answer

If you don t want to use any external libraries or 3rd party tools  Please try below code   This will parse xml into python dictionary This will parse xml attrbutes as well This will also parse empty tags like  lt tag  gt  and tags with only attributes like  lt tag var val  gt   Code import re  def getdict content       res re findall  quot  lt   P lt var gt  S    P lt attr gt     gt           gt   P lt val gt      lt    P var  gt        gt    quot  content      if len res  gt  1          attreg  quot   P lt avr gt  S            P lt quote gt     quot     P lt avl gt       P quote         P lt avl1 gt         s       P lt avl2 gt   s       quot          if len res  gt 1              return   i 0     quot  attributes quot    j 0   j 2  or j 3  or j 4    for j in re findall attreg i 1  strip        quot  values quot  getdict i 2      for i in res          else              return  res 0     quot  attributes quot    j 0   j 2  or j 3  or j 4    for j in re findall attreg res 1  strip        quot  values quot  getdict res 2          else          return content  with open  quot test xml quot   quot r quot   as f      print getdict f read   replace   n         Sample input  lt details class  quot 4b quot  count 1 boy gt       lt name type  quot firstname quot  gt John lt  name gt       lt age gt 13 lt  age gt       lt hobby gt Coin collection lt  hobby gt       lt hobby gt Stamp collection lt  hobby gt       lt address gt           lt country gt USA lt  country gt           lt state gt CA lt  state gt       lt  address gt   lt  details gt   lt details empty  quot True quot   gt   lt details  gt   lt details class  quot 4a quot  count 2 girl gt       lt name type  quot firstname quot  gt Samantha lt  name gt       lt age gt 13 lt  age gt       lt hobby gt Fishing lt  hobby gt       lt hobby gt Chess lt  hobby gt       lt address current  quot no quot  gt           lt country gt Australia lt  country gt           lt state gt NSW lt  state gt       lt  address gt   lt  details gt   Output  Beautified             quot details quot                      quot  attributes quot                              quot class quot    quot 4b quot                                        quot count quot    quot 1 quot                                        quot boy quot    quot  quot                                                  quot  values quot                              quot name quot                                      quot  attributes quot                                              quot type quot    quot firstname quot                                                                                          quot  values quot    quot John quot                                                                      quot age quot                                      quot  attributes quot                                                        quot  values quot    quot 13 quot                                                                      quot hobby quot                                      quot  attributes quot                                                        quot  values quot    quot Coin collection quot                                                                      quot hobby quot                                      quot  attributes quot                                                        quot  values quot    quot Stamp collection quot                                                                      quot address quot                                      quot  attributes quot                                                        quot  values quot                                              quot country quot                                                      quot  attributes quot                                                                                quot  values quot    quot USA quot                                                                                                              quot state quot                                                      quot  attributes quot                                                                                quot  values quot    quot CA quot                                                                                                                                                                      quot details quot                      quot  attributes quot                              quot empty quot    quot True quot                                                  quot  values quot    quot  quot                              quot details quot                      quot  attributes quot                                quot  values quot    quot  quot                              quot details quot                      quot  attributes quot                              quot class quot    quot 4a quot                                        quot count quot    quot 2 quot                                        quot girl quot    quot  quot                                                  quot  values quot                              quot name quot                                      quot  attributes quot                                              quot type quot    quot firstname quot                                                                                          quot  values quot    quot Samantha quot                                                                      quot age quot                                      quot  attributes quot                                                        quot  values quot    quot 13 quot                                                                      quot hobby quot                                      quot  attributes quot                                                        quot  values quot    quot Fishing quot                                                                      quot hobby quot                                      quot  attributes quot                                                        quot  values quot    quot Chess quot                                                                      quot address quot                                      quot  attributes quot                                              quot current quot    quot no quot                                                                                          quot  values quot                                              quot country quot                                                      quot  attributes quot                                                                                quot  values quot    quot Australia quot                                                                                                              quot state quot                                                      quot  attributes quot                                                                                quot  values quot    quot NSW quot

User · Answer

If the xml is in the form of a string as shown below then from lxml  import etree  objectify    sample xml as a string with a name space  http   xmlns abc com     message  b  lt  xml version  1 0  encoding  UTF-8   gt  r n lt pa Process xmlns pa  http   xmlns abc com  gt  r n t lt pa firsttag gt SAMPLE lt  pa firsttag gt  lt  pa Process gt  r n     this is a sample xml which is a string   print              message coversion and parsing starts                 message message decode  utf-8    message message replace   lt  xml version  1 0  encoding  UTF-8   gt  r n       replace is used to remove unwanted strings from the  message  message message replace  pa Process gt  r n   pa Process gt    print  message   print         Parsing starts                parser   etree XMLParser remove blank text True   the name space is removed here root   etree fromstring message  parser   parsing of xml happens here print         Parsing completed                 dict    for child in root    parsed xml is iterated using a for loop and values are stored in a dictionary     print child tag child text      print      Derving from xml tree            if child tag     http   xmlns abc com firsttag           dict  FIRST TAG   child text         print dict        output                message coversion and parsing starts               lt pa Process xmlns pa  http   xmlns abc com  gt        lt pa firsttag gt SAMPLE lt  pa firsttag gt  lt  pa Process gt        Parsing starts                    Parsing completed              http   xmlns abc com firsttag SAMPLE     Derving from xml tree        FIRST TAG    SAMPLE

User · Answer

I suggest xmltodict for simplicity   It parses your XML to an OrderedDict    gt  gt  gt  e     lt foo gt                lt bar gt                    lt type foobar  1   gt                    lt type foobar  2   gt                lt  bar gt           lt  foo gt      gt  gt  gt  import xmltodict  gt  gt  gt  result   xmltodict parse e   gt  gt  gt  result  OrderedDict   u foo   OrderedDict   u bar   OrderedDict   u type    OrderedDict   u  foobar   u 1      OrderedDict   u  foobar   u 2                 gt  gt  gt  result  foo    OrderedDict   u bar   OrderedDict   u type    OrderedDict   u  foobar   u 1      OrderedDict   u  foobar   u 2              gt  gt  gt  result  foo    bar    OrderedDict   u type    OrderedDict   u  foobar   u 1      OrderedDict   u  foobar   u 2

User · Answer

I suggest ElementTree   There are other compatible implementations of the same API  such as lxml  and cElementTree in the Python standard library itself  but  in this context  what they chiefly add is even more speed -- the ease of programming part depends on the API  which ElementTree defines   First build an Element instance root from the XML  e g  with the XML function  or by parsing a file with something like   import xml etree ElementTree as ET root   ET parse  thefile xml   getroot     Or any of the many other ways shown at ElementTree  Then do something like   for type tag in root findall  bar type        value   type tag get  foobar       print value    And similar  usually pretty simple  code patterns

User · Answer

lxml objectify is really simple   Taking your sample text   from lxml import objectify from collections import defaultdict  count   defaultdict int   root   objectify fromstring text   for item in root bar type      count item attrib get  foobar       1  print dict count    Output     1   1   2   1

User · Answer

There s no need to use a lib specific API if you use python-benedict  Just initialize a new instance from your XML and manage it easily since it is a dict subclass  Installation is easy  pip install python-benedict from benedict import benedict as bdict    data-source can be an url  a filepath or data-string  as in this example  data source    quot  quot  quot   lt foo gt      lt bar gt         lt type foobar  quot 1 quot   gt         lt type foobar  quot 2 quot   gt      lt  bar gt   lt  foo gt  quot  quot  quot   data   bdict from xml data source  t list   data  foo bar     yes  keypath supported for t in t list     print t   foobar     It supports and normalizes I O operations with many formats  Base64  CSV  JSON  TOML  XML  YAML and query-string  It is well tested and open-source on GitHub  Disclosure  I am the author

User · Answer

If the source is an xml file  say like this sample   lt pa Process xmlns pa  http   sssss  gt           lt pa firsttag gt SAMPLE lt  pa firsttag gt       lt  pa Process gt    you may try the following code  from lxml import etree  objectify metadata    C   Users  PROCS xml    this is sample xml file the contents are shown above parser   etree XMLParser remove blank text True    this line removes the  name space from the xml in this sample the name space is -- gt  http   sssss tree   etree parse metadata  parser    this line parses the xml file which is PROCS xml root   tree getroot     we get the root of xml which is process and iterate using a for loop for elem in root getiterator        if not hasattr elem tag   find    continue     1      i   elem tag find          if i  gt   0          elem tag   elem tag i 1    dict       a python dictionary is declared for elem in tree iter     iterating through the xml tree using a for loop     if elem tag    firsttag     if the tag name matches the name that is equated then the text in the tag is stored into the dictionary         dict  FIRST TAG   str elem text          print dict    Output would be    FIRST TAG    SAMPLE

User · Answer

Just to add another possibility  you can use untangle  as it is a simple xml-to-python-object library  Here you have an example   Installation   pip install untangle   Usage   Your XML file  a little bit changed     lt foo gt      lt bar name  bar name  gt         lt type foobar  1   gt      lt  bar gt   lt  foo gt    Accessing the attributes with untangle   import untangle  obj   untangle parse   path to xml file file xml    print obj foo bar  name   print obj foo bar type  foobar     The output will be   bar name 1   More information about untangle can be found in  untangle    Also  if you are curious  you can find a list of tools for working with XML and Python in  Python and XML   You will also see that the most common ones were mentioned by previous answers

User · Answer

XML    lt foo gt      lt bar gt         lt type foobar  1   gt         lt type foobar  2   gt      lt  bar gt   lt  foo gt    Python code   import xml etree cElementTree as ET  tree   ET parse  foo xml   root   tree getroot    root tag   root tag print root tag    for form in root findall    bar type        x  form attrib      z list x      for i in z          print x i     Output   foo 1 2

User · Answer

I might suggest declxml   Full disclosure  I wrote this library because I was looking for a way to convert between XML and Python data structures without needing to write dozens of lines of imperative parsing serialization code with ElementTree   With declxml  you use processors to declaratively define the structure of your XML document and how to map between XML and Python data structures  Processors are used to for both serialization and parsing as well as for a basic level of validation   Parsing into Python data structures is straightforward   import declxml as xml  xml string        lt foo gt      lt bar gt         lt type foobar  1   gt         lt type foobar  2   gt      lt  bar gt   lt  foo gt       processor   xml dictionary  foo         xml dictionary  bar             xml array xml integer  type   attribute  foobar               xml parse from string processor  xml string    Which produces the output     bar     foobar    1  2      You can also use the same processor to serialize data to XML  data     bar          foobar    7  3  21  16  11      xml serialize to string processor  data  indent           Which produces the following output   lt  xml version  1 0    gt   lt foo gt       lt bar gt           lt type foobar  7   gt           lt type foobar  3   gt           lt type foobar  21   gt           lt type foobar  16   gt           lt type foobar  11   gt       lt  bar gt   lt  foo gt    If you want to work with objects instead of dictionaries  you can define processors to transform data to and from objects as well   import declxml as xml  class Bar       def   init   self           self foobars           def   repr   self           return  Bar foobars      format self foobars    xml string        lt foo gt      lt bar gt         lt type foobar  1   gt         lt type foobar  2   gt      lt  bar gt   lt  foo gt       processor   xml dictionary  foo         xml user object  bar   Bar            xml array xml integer  type   attribute  foobar    alias  foobars              xml parse from string processor  xml string    Which produces the following output    bar   Bar foobars  1  2

[python] How to parse XML and count instances of a particular node attribute?

Examples related to python

Examples related to xml