How to parse XML in Bash

Question

Ideally  what I would like to be able to do is   cat xhtmlfile xhtml   getElementViaXPath --path   html head title    sed -e  s    lt title gt   lt  title gt     g   gt  titleOfXHTMLPage txt

User · Answer

Another command line tool is my new Xidel. It also supports XPath 2 and XQuery, contrary to the already mentioned xpath/xmlstarlet.

The title can be read like:

xidel xhtmlfile.xhtml -e /html/head/title > titleOfXHTMLPage.txt

And it also has a cool feature to export multiple variables to bash. For example

eval $(xidel xhtmlfile.xhtml -e 'title := //title, imgcount := count(//img)' --output-format bash )

sets $title to the title and $imgcount to the number of images in the file, which should be as flexible as parsing it directly in bash.

User · Answer

You can use xpath utility  It s installed with the Perl XML-XPath package   Usage    usr bin xpath  filename  query   or XMLStarlet  To install it on opensuse use   sudo zypper install xmlstarlet   or try cnf xml on other platforms

User · Answer

Check out XML2 from http   www ofb net  egnor xml2  which converts XML to a line-oriented format

User · Answer

This is really just an explaination of Yuzem s answer  but I didn t feel like this much editing should be done to someone else  and comments don t allow formatting  so     rdom      local IFS   gt    read -d   lt  E C      Let s call that  read dom  instead of  rdom   space it out a bit and use longer variables   read dom          local IFS   gt      read -d   lt  ENTITY CONTENT     Okay so it defines a function called read dom  The first line makes IFS  the input field separator  local to this function and changes it to    That means that when you read data instead of automatically being split on space  tab or newlines it gets split on      The next line says to read input from stdin  and instead of stopping at a newline  stop when you see a   lt   character  the -d for deliminator flag   What is read is then split using the IFS and assigned to the variable ENTITY and CONTENT  So take the following    lt tag gt value lt  tag gt    The first call to read dom get an empty string  since the   lt   is the first character   That gets split by IFS into just     since there isn t a     character  Read then assigns an empty string to both variables  The second call gets the string  tag value   That gets split then by the IFS into the two fields  tag  and  value   Read then assigns the variables like  ENTITY tag and CONTENT value  The third call gets the string   tag    That gets split by the IFS into the two fields   tag  and     Read then assigns the variables like  ENTITY  tag and CONTENT   The fourth call will return a non-zero status because we ve reached the end of file   Now his while loop cleaned up a bit to match the above   while read dom  do     if     ENTITY    title      then         echo  CONTENT         exit     fi done  lt  xhtmlfile xhtml  gt  titleOfXHTMLPage txt   The first line just says   while the read dom functionreturns a zero status  do the following   The second line checks if the entity we ve just seen is  title   The next line echos the content of the tag  The four line exits  If it wasn t the title entity then the loop repeats on the sixth line  We redirect  xhtmlfile xhtml  into standard input  for the read dom function  and redirect standard output to  titleOfXHTMLPage txt   the echo from earlier in the loop    Now given the following  similar to what you get from listing a bucket on S3  for input xml    lt ListBucketResult xmlns  http   s3 amazonaws com doc 2006-03-01   gt     lt Name gt sth-items lt  Name gt     lt IsTruncated gt false lt  IsTruncated gt     lt Contents gt       lt Key gt item-apple-iso 2x png lt  Key gt       lt LastModified gt 2011-07-25T22 23 04 000Z lt  LastModified gt       lt ETag gt  amp quot 0032a28286680abee71aed5d059c6a09 amp quot  lt  ETag gt       lt Size gt 1785 lt  Size gt       lt StorageClass gt STANDARD lt  StorageClass gt     lt  Contents gt   lt  ListBucketResult gt    and the following loop   while read dom  do     echo   ENTITY   gt   CONTENT  done  lt  input xml   You should get      gt   ListBucketResult xmlns  http   s3 amazonaws com doc 2006-03-01     gt   Name   gt  sth-items  Name   gt   IsTruncated   gt  false  IsTruncated   gt   Contents   gt   Key   gt  item-apple-iso 2x png  Key   gt   LastModified   gt  2011-07-25T22 23 04 000Z  LastModified   gt   ETag   gt   amp quot 0032a28286680abee71aed5d059c6a09 amp quot   ETag   gt   Size   gt  1785  Size   gt   StorageClass   gt  STANDARD  StorageClass   gt    Contents   gt     So if we wrote a while loop like Yuzem s   while read dom  do     if     ENTITY    Key       then         echo  CONTENT     fi done  lt  input xml   We d get a listing of all the files in the S3 bucket   EDIT If for some reason local IFS   gt  doesn t work for you and you set it globally  you should reset it at the end of the function like   read dom          ORIGINAL IFS  IFS     IFS   gt      read -d   lt  ENTITY CONTENT     IFS  ORIGINAL IFS     Otherwise  any line splitting you do later in the script will be messed up   EDIT 2 To split out attribute name value pairs you can augment the read dom   like so   read dom          local IFS   gt      read -d   lt  ENTITY CONTENT     local ret        TAG NAME   ENTITY          ATTRIBUTES   ENTITY         return  ret     Then write your function to parse and get the data you want like this   parse dom          if     TAG NAME    foo       then         eval local  ATTRIBUTES         echo  foo size is   size      elif     TAG NAME    bar       then         eval local  ATTRIBUTES         echo  bar type is   type      fi     Then while you read dom call parse dom   while read dom  do     parse dom done   Then given the following example markup    lt example gt     lt bar size  bar size  type  metal  gt bars content lt  bar gt     lt foo size  1789  type  unknown  gt foos content lt  foo gt   lt  example gt    You should get this output     cat example xml     bash xml sh  bar type is  metal foo size is  1789   EDIT 3 another user said they were having problems with it in FreeBSD and suggested saving the exit status from read and returning it at the end of read dom like   read dom          local IFS   gt      read -d   lt  ENTITY CONTENT     local RET        TAG NAME   ENTITY          ATTRIBUTES   ENTITY         return  RET     I don t see any reason why that shouldn t work

User · Answer

Yuzem s method can be improved by inversing the order of the  lt  and  gt  signs in the rdom function and the variable assignments  so that   rdom      local IFS   gt    read -d   lt  E C      becomes   rdom      local IFS   lt    read -d   gt  C E      If the parsing is not done like this  the last tag in the XML file is never reached  This can be problematic if you intend to output another XML file at the end of the while loop

User · Answer

I am not aware of any pure shell XML parsing tool  So you will most likely need a tool written in an other language   My XML  Twig Perl module comes with such a tool  xml grep  where you would probably write what you want as xml grep -t   html head title  xhtmlfile xhtml  gt  titleOfXHTMLPage txt  the -t option gives you the result as text instead of xml

User · Answer

While it seems like  never parse XML  JSON    from bash without a proper tool  is sound advice  I disagree  If this is side job  it is waistfull to look for the proper tool  then learn it    Awk can do it in minutes  My programs have to work on all above mentioned and more kinds of data  Hell  I do not want to test 30 tools to parse 5-7-10 different formats I need if I can awk the problem in minutes  I do not care about XML  JSON or whatever  I need a single solution for all of them   As an example  my SmartHome program runs our homes  While doing it  it reads plethora of data in too many different formats I can not control  I never use dedicated  proper tools since I do not want to spend more than minutes on reading the data I need  With FS and RS adjustments  this awk solution works perfectly for any textual format  But  it may not be the proper answer when your primary task is to work primarily with loads of data in that format   The problem of parsing XML from bash I faced yesterday  Here is how I do it for any hierarchical data format  As a bonus - I assign data directly to the variables in a bash script   To make thins easier to read  I will present solution in stages  From the OP test data  I created a file  test xml  Parsing said XML in bash and extracting the data in 90 chars   awk  BEGIN   FS   lt   gt    RS   n      host username password dbname    print  2   4    test xml   I normally use more readable version since it is easier to modify in real life as I often need to test differently   awk  BEGIN   FS   lt   gt    RS   n       if   0    host username password dbname   print  2  4   test xml   I do not care how is the format called  I seek only the simplest solution  In this particular case  I can see from the data that newline is the record separator  RS  and  lt   delimit fields  FS   In my original case  I had complicated indexing of 6 values within two records  relating them  find when the data exists plus fields  records  may or may not exist  It took 4 lines of awk to solve the problem perfectly  So  adapt idea to each need before using it   Second part simply looks it there is wanted string in a line  RS  and if so  prints out needed fields  FS   The above took me about 30 seconds to copy and adapt from the last command I used this way  4 times longer   And that is it  Done in 90 chars   But  I always need to get the data neatly into variables in my script  I first test the constructs like so   awk  BEGIN   FS   lt   gt    RS   n       if   0    host username password dbname   print  2      4        test xml   In some cases I use printf instead of print  When I see everything looks well  I simply finish assigning values to variables  I know many think  eval  is  evil   no need to comment    Trick works perfectly on all four of my networks for years  But keep learning if you do not understand why this may be bad practice  Including bash variable assignments and ample spacing  my solution needs 120 chars to do everything   eval    awk  BEGIN   FS   lt   gt    RS   n       if   0    host username password dbname   print  2      4        test xml    echo  host   host  username   username  password   password dbname   dbname

User · Answer

This is sufficient     xpath xhtmlfile xhtml   html head title text     gt  titleOfXHTMLPage txt

User · Answer

This works if you are wanting XML attributes     cat alfa xml  lt video server  asdf com  stream  H264 400 mp4  cdn  limelight   gt     sed  s         s   gt     alfa xml  gt  alfa sh        alfa sh    echo   stream  H264 400 mp4

User · Answer

While there are quite a few ready-made console utilities that might do what you want  it will probably take less time to write a couple of lines of code in a general-purpose programming language such as Python which you can easily extend and adapt to your needs   Here is a python script which uses lxml for parsing     it takes the name of a file or a URL as the first parameter  an XPath expression as the second parameter  and prints the strings nodes matching the given expression   Example 1     usr bin env python import sys from lxml import etree  tree   etree parse sys argv 1   xpath expression   sys argv 2      a hack allowing to access the    default namespace  if defined  via the  p   prefix        E g  given a default namespaces such as  xmlns  http   maven apache org POM 4 0 0      an XPath of    p module  will return all the  module  nodes ns   tree getroot   nsmap if ns keys   and None in ns      ns  p     ns pop None      end of hack      for e in tree xpath xpath expression  namespaces ns       if isinstance e  str           print e      else          print e text and e text strip   or etree tostring e  pretty print True     lxml can be installed with pip install lxml  On ubuntu you can use sudo apt install python-lxml   Usage  python xpath py myfile xml    mynode    lxml also accepts a URL as input   python xpath py http   www feedforall com sample xml    link       Note  If your XML has a default namespace with no prefix  e g  xmlns http   abc     then you have to use the p prefix  provided by the  hack   in your expressions  e g    p module to get the modules from a pom xml file  In case the p prefix is already mapped in your XML  then you ll need to modify the script to use another prefix      Example 2  A one-off script which serves the narrow purpose of extracting module names from an apache maven file  Note how the node name  module  is prefixed with the default namespace  http   maven apache org POM 4 0 0    pom xml     lt  xml version  1 0  encoding  UTF-8   gt   lt project xmlns  http   maven apache org POM 4 0 0  xmlns xsi  http   www w3 org 2001 XMLSchema-instance  xsi schemaLocation  http   maven apache org POM 4 0 0 http   maven apache org xsd maven-4 0 0 xsd  gt       lt modules gt           lt module gt cherries lt  module gt           lt module gt bananas lt  module gt           lt module gt pears lt  module gt       lt  modules gt   lt  project gt    module extractor py    from lxml import etree for    e in etree iterparse open  pom xml    tag   http   maven apache org POM 4 0 0 module        print e text

User · Answer

Command-line tools that can be called from shell scripts include   4xpath - command-line wrapper around Python s 4Suite package  XMLStarlet  xpath - command-line wrapper around Perl s XPath library sudo apt-get install libxml-xpath-perl   Xidel - Works with URLs as well as files  Also works with JSON   I also use xmllint and xsltproc with little XSL transform scripts to do XML processing from the command line or in shell scripts

User · Answer

After some research for translation between Linux and Windows formats of the file paths in XML files I found interesting tutorials and solutions on    General informations about XPaths Amara - collection of Pythonic tools for XML Develop Python XML with 4Suite  2 parts

User · Answer

You can do that very easily using only bash  You only have to add this function   rdom      local IFS   gt    read -d   lt  E C      Now you can use rdom like read but for html documents  When called rdom will assign the element to variable E and the content to var C   For example  to do what you wanted to do   while rdom  do     if     E   title     then         echo  C         exit     fi done  lt  xhtmlfile xhtml  gt  titleOfXHTMLPage txt

User · Answer

starting from the chad s answer  here is the COMPLETE working solution to parse UML  with propper handling of comments  with just 2 little functions  more than 2 bu you can mix them all   I don t say chad s one didn t work at all  but it had too much issues with badly formated XML files  So you have to be a bit more tricky to handle comments and misplaced spaces CR TAB etc   The purpose of this answer is to give ready-2-use  out of the box bash functions to anyone needing parsing UML without complex tools using perl  python or anything else  As for me  I cannot install cpan  nor perl modules for the old production OS i m working on  and python isn t available   First  a definition of the UML words used in this post    lt  -- comment    -- gt   lt tag attribute  value  gt content    lt  tag gt    EDIT  updated functions  with handle of    Websphere xml  xmi and xmlns attributes  must have a compatible terminal with 256 colors 24 shades of grey compatibility added for IBM AIX bash 3 2 16 1    The functions  first is the xml read dom which s called recursively by xml read   xml read dom       https   stackoverflow com questions 893585 how-to-parse-xml-in-bash local ENTITY IFS   gt  if  ITSACOMMENT  then   read -d   lt  COMMENTS   COMMENTS    rtrim    COMMENTS       return 0 else   read -d   lt  ENTITY CONTENT   CR         x  ENTITY 0 1 x      x x     amp  amp  return 0   TAG NAME   ENTITY     space           x  TAG NAME x      x xmlx     amp  amp  TAG NAME xml   TAG NAME   TAG NAME        ATTRIBUTES   ENTITY     space       ATTRIBUTES    ATTRIBUTES  xmi       ATTRIBUTES    ATTRIBUTES  xmlns     fi    when comments sticks to  --      x  TAG NAME 0 3 x      x --x     amp  amp  COMMENTS    TAG NAME 3    ATTRIBUTES    amp  amp  ITSACOMMENT true  amp  amp  return 0    http   tldp org LDP abs html string-manipulation html   INFO  oh wait it doesn t work on IBM AIX bash 3 2 16 1        x  ATTRIBUTES  -1  1 x      x x  -o  x  ATTRIBUTES  -1  1 x      x x     amp  amp  ATTRIBUTES    ATTRIBUTES 0  -1       x  ATTRIBUTES    ATTRIBUTES  -1 1 x      x x  -o  x  ATTRIBUTES    ATTRIBUTES  -1 1 x      x x     amp  amp  ATTRIBUTES    ATTRIBUTES 0    ATTRIBUTES  -1   return  CR     and the second one    xml read       https   stackoverflow com questions 893585 how-to-parse-xml-in-bash ITSACOMMENT false local MULTIPLE ATTR LIGHT FORCE PRINT XAPPLY XCOMMAND XATTRIBUTE GETCONTENT fileXml tag attributes attribute tag2print TAGPRINTED attribute2print XAPPLIED COLOR PROSTPROCESS USAGE local TMP LOG LOGG LIGHT false FORCE PRINT false XAPPLY false MULTIPLE ATTR false XAPPLIED COLOR g TAGPRINTED false GETCONTENT false PROSTPROCESS cat Debug   Debug -false  TMP  tmp xml read  RANDOM USAGE    C   FUNCNAME   c   -cdlp   -x command  lt -a attribute gt    lt file xml gt   tag     any     attributes        content      nn 2    -c   NOCOLOR  END    nn 2    -d   Debug  END    nn 2    -l   LIGHT  no   attribute    printed   END    nn 2    -p   FORCE PRINT  when no attributes given   END    nn 2    -x   apply a command on an attribute and print the result instead of the former value  in green color  END    nn 1     no attribute given will load their values into your shell  use  -p  to print them as well   END              amp  amp  echo2   USAGE   amp  amp  return 99        lt  2     amp  amp  ERROR nbaram 2 0  amp  amp  return 99   getopts  while getopts  cdlpx a   OPT 2 gt  dev null do     case    OPT  in     c  PROSTPROCESS    DECOLORIZE          d  local Debug true        l  LIGHT true  XAPPLIED COLOR END        p  FORCE PRINT true        x  XAPPLY true  XCOMMAND    OPTARG          a  XATTRIBUTE    OPTARG              NOARGS     NOARGS    NOARGS    -  OPTARG        esac   done shift    OPTIND - 1   unset  OPT OPTARG OPTIND    X   NOARGS       X     amp  amp  ERROR param     NOARGS   0  fileXml  1 tag  2        gt  2     amp  amp  shift 2  amp  amp  attributes           gt  1     amp  amp  MULTIPLE ATTR true    -d    fileXml   -o   -s    fileXml      amp  amp  ERROR empty    fileXml   0  amp  amp  return 1  XAPPLY  amp  amp   MULTIPLE ATTR  amp  amp    -z    XATTRIBUTE      amp  amp  ERROR param  -x command   0  amp  amp  return 2   nb attributes    1 because  MULTIPLE ATTR is false      attributes       content     amp  amp  GETCONTENT true  while xml read dom  do        CR    0     amp  amp  break      PIPESTATUS 1     0     amp  amp  break    if  ITSACOMMENT  then       oh wait it doesn t work on IBM AIX bash 3 2 16 1         if    x  COMMENTS  -2  2 x      x--x     then COMMENTS    COMMENTS 0  -2     amp  amp  ITSACOMMENT false       elif    x  COMMENTS  -3  3 x      x-- gt x     then COMMENTS    COMMENTS 0  -3     amp  amp  ITSACOMMENT false     if    x  COMMENTS    COMMENTS  - 2 2 x      x--x     then COMMENTS    COMMENTS 0    COMMENTS  - 2    amp  amp  ITSACOMMENT false     elif    x  COMMENTS    COMMENTS  - 3 3 x      x-- gt x     then COMMENTS    COMMENTS 0    COMMENTS  - 3    amp  amp  ITSACOMMENT false     fi      Debug  amp  amp  echo2    N   COMMENTS   END     elif test    TAG NAME    then     if    x  TAG NAME x      x  tag x  -o  x  tag x      xanyx     then       if  GETCONTENT  then         CONTENT    trim    CONTENT             test   CONTENT   amp  amp  echo    CONTENT         else           eval local  ATTRIBUTES   gt  eval test        attribute     will be true for matching attributes         eval local  ATTRIBUTES          Debug  amp  amp   echo2    m   TAG NAME     M  ATTRIBUTES  END    test   CONTENT   amp  amp  echo2    m CONTENT   M  CONTENT  END            if test    attributes    then           if  MULTIPLE ATTR  then               we don t print  tag  attr x      for a tag passed as argument  it s usefull only for  any  tags so then we print the matching tags found                LIGHT  amp  amp     x  tag x      xanyx     amp  amp  tag2print    g6   TAG NAME                 for attribute in   attributes   do                  LIGHT  amp  amp  attribute2print    g10   attribute   g6    g14                 if eval test        attribute      then                 test    tag2print    amp  amp    print     tag2print                   TAGPRINTED true  unset tag2print                 if     XAPPLY      true  -a    attribute         XATTRIBUTE      then                   eval   print    s s        attribute2print         XAPPLIED COLOR         XCOMMAND     attribute       END    amp  amp  eval unset   attribute                  else                   eval   print    s s        attribute2print          attribute      amp  amp  eval unset   attribute                  fi               fi             done               this trick prints a CR only if attributes have been printed durint the loop               TAGPRINTED  amp  amp    print    n   amp  amp  TAGPRINTED false           else             if eval test        attributes      then               if  XAPPLY  then                 eval echo     g      XCOMMAND     attributes     amp  amp  eval unset   attributes                else                 eval echo      attributes    amp  amp  eval unset   attributes                fi             fi           fi         else           echo eval  ATTRIBUTES  gt  gt  TMP         fi       fi     fi   fi   unset CR TAG NAME ATTRIBUTES CONTENT COMMENTS done  lt     fileXml       PROSTPROCESS    http   mywiki wooledge org BashFAQ 024   INFO  I set variables in a  while loop  that s in a pipeline  Why do they disappear  workaround  if   -s   TMP     then    FORCE PRINT  amp  amp     LIGHT  amp  amp  cat  TMP      FORCE PRINT  amp  amp   LIGHT  amp  amp  perl -pe  s    space          g   TMP    FORCE PRINT  amp  amp   LIGHT  amp  amp  sed -r  s                                   1  g   TMP      TMP   rm -f  TMP fi unset ITSACOMMENT     and lastly  the rtrim  trim and echo2  to stderr  functions   rtrim     local var    var    var    var       space            remove trailing whitespace characters echo -n   var    trim     local var    var    var    var      space             remove leading whitespace characters var    var    var       space            remove trailing whitespace characters echo -n   var    echo2     echo -e      1 gt  amp 2      Colorization   oh and you will need some neat colorizing dynamic variables to be defined at first  and exported  too   set -a TERM xterm-256color case   UNAME  in AIX SunOS    M     print    033 1 35m     m     print    033 0 35m     END     print    033 0m           m   tput setaf 5    M   tput setaf 13      END   tput sgr0             issue on Linux  it can produces    B instead of    0m  more likely when using screenrc   END     print    033 0m      esac   24 shades of grey  for i in   seq 0 23   do eval g i      print      033  38  5     232   i  m       done   another way of having an array of 5 shades of grey  declare -a colorNums  238 240 243 248 254  for num in 0 1 2 3 4  do nn  num      print    033 38 5   colorNums  num  m    NN  num      print    033 48 5   colorNums  num  m    done   piped decolorization  DECOLORIZE  eval sed  s   END    0-9    m K   g     How to load all that stuff   Either you know how to create functions and load them via FPATH  ksh  or an emulation of FPATH  bash   If not  just copy paste everything on the command line   How does it work   xml read  -cdlp   -x command  lt -a attribute gt    lt file xml gt   tag    any    attributes       content     -c   NOCOLOR   -d   Debug   -l   LIGHT  no   attribute    printed    -p   FORCE PRINT  when no attributes given    -x   apply a command on an attribute and print the result instead of the former value  in green color    no attribute given will load their values into your shell as  ATTRIBUTE value  use  -p  to print them as well   xml read server xml title content       print content between  lt title gt  lt  title gt  xml read server xml Connector port      print all port values from Connector tags xml read server xml any port            print all port values from any tags   With Debug mode  -d  comments and parsed attributes are printed to stderr

User · Answer

Well  you can use xpath utility  I guess perl s XML  Xpath contains it

[xml] How to parse XML in Bash?

Examples related to xml

Examples related to bash

Examples related to xhtml

Examples related to shell

Examples related to xpath