[linux] How to parse XML using shellscript?

I would like to know what would be the best way to parse an XML file using shellscript ?

  • Should one do it by hand ?
  • Does third tiers library exist ?

If you already made it if you could let me know how did you manage to do it

This question is related to linux bash shell

The answer is


Here's a full working example.
If it's only extracting email addresses you could just do something like:
1) Suppose XML file spam.xml is like

<spam>
<victims>
  <victim>
    <name>The Pope</name>
    <email>[email protected]</email>
    <is_satan>0</is_satan>
  </victim>
  <victim>
    <name>George Bush</name>
    <email>[email protected]</email>
    <is_satan>1</is_satan>
  </victim>
  <victim>
    <name>George Bush Jr</name>
    <email>[email protected]</email>
    <is_satan>0</is_satan>
  </victim>
</victims>
</spam>

2) You can get the emails and process them with this short bash code:

#!/bin/bash
emails=($(grep -oP '(?<=email>)[^<]+' "/my_path/spam.xml"))

for i in ${!emails[*]}
do
  echo "$i" "${emails[$i]}"
  # instead of echo use the values to send emails, etc
done

Result of this example is:

0 [email protected]
1 [email protected]
2 [email protected]

Important note:
Don't use this for serious matters. This is OK for playing around, getting quick results, learning grep, etc. but you should definitely look for, learn and use an XML parser for production (see Micha's comment below).


This really is beyond the capabilities of shell script. Shell script and the standard Unix tools are okay at parsing line oriented files, but things change when you talk about XML. Even simple tags can present a problem:

<MYTAG>Data</MYTAG>

<MYTAG>
     Data
</MYTAG>

<MYTAG param="value">Data</MYTAG>

<MYTAG><ANOTHER_TAG>Data
</ANOTHER_TAG><MYTAG>

Imagine trying to write a shell script that can read the data enclosed in . The three very, very simply XML examples all show different ways this can be an issue. The first two examples are the exact same syntax in XML. The third simply has an attribute attached to it. The fourth contains the data in another tag. Simple sed, awk, and grep commands cannot catch all possibilities.

You need to use a full blown scripting language like Perl, Python, or Ruby. Each of these have modules that can parse XML data and make the underlying structure easier to access. I've use XML::Simple in Perl. It took me a few tries to understand it, but it did what I needed, and made my programming much easier.


Try using xpath. You can use it to parse elements out of an xml tree.

http://www.ibm.com/developerworks/xml/library/x-tipclp/index.html


There's also xmlstarlet (which is available for Windows as well).

http://xmlstar.sourceforge.net/doc/xmlstarlet.txt


I am surprised no one has mentioned xmlsh. The mission statement :

A command line shell for XML Based on the philosophy and design of the Unix Shells

xmlsh provides a familiar scripting environment, but specifically tailored for scripting xml processes.

A list of shell like commands are provided here.

I use the xed command a lot which is equivalent to sed for XML, and allows XPath based search and replaces.


Here's a solution using xml_grep (because xpath wasn't part of our distributable and I didn't want to add it to all production machines)...

If you are looking for a specific setting in an XML file, and if all elements at a given tree level are unique, and there are no attributes, then you can use this handy function:

# File to be parsed
xmlFile="xxxxxxx"

# use xml_grep to find settings in an XML file
# Input ($1): path to setting
function getXmlSetting() {

    # Filter out the element name for parsing
    local element=`echo $1 | sed 's/^.*\///'`

    # Verify the element is not empty
    local check=${element:?getXmlSetting invalid input: $1}

    # Parse out the CDATA from the XML element
    # 1) Find the element (xml_grep)
    # 2) Remove newlines (tr -d \n)
    # 3) Extract CDATA by looking for *element> CDATA <element*
    # 4) Remove leading and trailing spaces
    local getXmlSettingResult=`xml_grep --cond $1 $xmlFile 2>/dev/null | tr -d '\n' | sed -n -e "s/.*$element>[[:space:]]*\([^[:space:]].*[^[:space:]]\)[[:space:]]*<\/$element.*/\1/p"`

    # Return the result
    echo $getXmlSettingResult
}

#EXAMPLE
logPath=`getXmlSetting //config/logs/path`
check=${logPath:?"XML file missing //config/logs/path"}

This will work with this structure:

<config>
  <logs>
     <path>/path/to/logs</path>
  <logs>
</config>

It will also work with this (but it won't keep the newlines):

<config>
  <logs>
     <path>
          /path/to/logs
     </path>
  <logs>
</config>

If you have duplicate <config> or <logs> or <path>, then it will only return the last one. You can probably modify the function to return an array if it finds multiple matches.

FYI: This code works on RedHat 6.3 with GNU BASH 4.1.2, but I don't think I'm doing anything particular to that, so should work everywhere.

NOTE: For anybody new to scripting, make sure you use the right types of quotes, all three are used in this code (normal single quote '=literal, backward single quote `=execute, and double quote "=group).


Do you have xml_grep installed? It's a perl based utility standard on some distributions (it came pre-installed on my CentOS system). Rather than giving it a regular expression, you give it an xpath expression.


Try sgrep. It's not clear exactly what you are trying to do, but I surely would not attempt writing an XML parser in bash.


A rather new project is the xml-coreutils package featuring xml-cat, xml-cp, xml-cut, xml-grep, ...

http://xml-coreutils.sourceforge.net/contents.html


Examples related to linux

grep's at sign caught as whitespace How to prevent Google Colab from disconnecting? "E: Unable to locate package python-pip" on Ubuntu 18.04 How to upgrade Python version to 3.7? Install Qt on Ubuntu Get first line of a shell command's output Cannot connect to the Docker daemon at unix:/var/run/docker.sock. Is the docker daemon running? Run bash command on jenkins pipeline How to uninstall an older PHP version from centOS7 How to update-alternatives to Python 3 without breaking apt?

Examples related to bash

Comparing a variable with a string python not working when redirecting from bash script Zipping a file in bash fails How do I prevent Conda from activating the base environment by default? Get first line of a shell command's output Fixing a systemd service 203/EXEC failure (no such file or directory) /bin/sh: apt-get: not found VSCode Change Default Terminal Run bash command on jenkins pipeline How to check if the docker engine and a docker container are running? How to switch Python versions in Terminal?

Examples related to shell

Comparing a variable with a string python not working when redirecting from bash script Get first line of a shell command's output How to run shell script file using nodejs? Run bash command on jenkins pipeline Way to create multiline comments in Bash? How to do multiline shell script in Ansible How to check if a file exists in a shell script How to check if an environment variable exists and get its value? Curl to return http status code along with the response docker entrypoint running bash script gets "permission denied"