[xml] How to execute XPath one-liners from shell?

Is there a package out there, for Ubuntu and/or CentOS, that has a command-line tool that can execute an XPath one-liner like foo //element@attribute filename.xml or foo //element@attribute < filename.xml and return the results line by line?

I'm looking for something that would allow me to just apt-get install foo or yum install foo and then just works out-of-the-box, no wrappers or other adaptation necessary.

Here are some examples of things that come close:

Nokogiri. If I write this wrapper I could call the wrapper in the way described above:

#!/usr/bin/ruby

require 'nokogiri'

Nokogiri::XML(STDIN).xpath(ARGV[0]).each do |row|
  puts row
end

XML::XPath. Would work with this wrapper:

#!/usr/bin/perl

use strict;
use warnings;
use XML::XPath;

my $root = XML::XPath->new(ioref => 'STDIN');
for my $node ($root->find($ARGV[0])->get_nodelist) {
  print($node->getData, "\n");
}

xpath from XML::XPath returns too much noise, -- NODE -- and attribute = "value".

xml_grep from XML::Twig cannot handle expressions that do not return elements, so cannot be used to extract attribute values without further processing.

EDIT:

echo cat //element/@attribute | xmllint --shell filename.xml returns noise similar to xpath.

xmllint --xpath //element/@attribute filename.xml returns attribute = "value".

xmllint --xpath 'string(//element/@attribute)' filename.xml returns what I want, but only for the first match.

For another solution almost satisfying the question, here is an XSLT that can be used to evaluate arbitrary XPath expressions (requires dyn:evaluate support in the XSLT processor):

<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
    xmlns:dyn="http://exslt.org/dynamic" extension-element-prefixes="dyn">
  <xsl:output omit-xml-declaration="yes" indent="no" method="text"/>
  <xsl:template match="/">
    <xsl:for-each select="dyn:evaluate($pattern)">
      <xsl:value-of select="dyn:evaluate($value)"/>
      <xsl:value-of select="'&#10;'"/>
    </xsl:for-each> 
  </xsl:template>
</xsl:stylesheet>

Run with xsltproc --stringparam pattern //element/@attribute --stringparam value . arbitrary-xpath.xslt filename.xml.

This question is related to xml shell xpath cross-platform

The answer is


It bears mentioning that nokogiri itself ships with a command line tool, which should be installed with gem install nokogiri.

You might find this blog post useful.


Since this project is apparently fairly new, check out https://github.com/jeffbr13/xq , seems to be a wrapper around lxml, but that is all you really need (and posted ad hoc solutions using lxml in other answers as well)


My Python script xgrep.py does exactly this. In order to search for all attributes attribute of elements element in files filename.xml ..., you would run it as follows:

xgrep.py "//element/@attribute" filename.xml ...

There are various switches for controlling the output, such as -c for counting matches, -i for indenting the matching parts, and -l for outputting filenames only.

The script is not available as a Debian or Ubuntu package, but all of its dependencies are.


One package that is very likely to be installed on a system already is python-lxml. If so, this is possible without installing any extra package:

python -c "from lxml.etree import parse; from sys import stdin; print('\n'.join(parse(stdin).xpath('//element/@attribute')))"

You can also try my Xidel. It is not in a package in the repository, but you can just download it from the webpage (it has no dependencies).

It has simple syntax for this task:

xidel filename.xml -e '//element/@attribute' 

And it is one of the rare of these tools that supports XPath 2.


clacke’s answer is great but I think only works if your source is well-formed XML, not normal HTML.

So to do the same for normal Web content—HTML docs that aren’t necessarily well-formed XML:

echo "<p>foo<div>bar</div><p>baz" | python -c "from sys import stdin; \
from lxml import html; \
print '\n'.join(html.tostring(node) for node in html.parse(stdin).xpath('//p'))"

And to instead use html5lib (to ensure you get the same parsing behavior as Web browsers—because like browser parsers, html5lib conforms to the parsing requirements in the HTML spec).

echo "<p>foo<div>bar</div><p>baz" | python -c "from sys import stdin; \
import html5lib; from lxml import html; \
doc = html5lib.parse(stdin, treebuilder='lxml', namespaceHTMLElements=False); \
print '\n'.join(html.tostring(node) for node in doc.xpath('//p'))

Install the BaseX database, then use it's "standalone command-line mode" like this:

basex -i - //element@attribute < filename.xml

or

basex -i filename.xml //element@attribute

The query language is actually XQuery (3.0), not XPath, but since XQuery is a superset of XPath, you can use XPath queries without ever noticing.


Sorry to be yet another voice in the fray. I tried all the tools in this thread and found none of them to be satisfactory for my needs, so I wrote my own. You can find it here: https://github.com/charmparticle/xpe

It's been uploaded to pypi, so you can easily install it with pip3 like so:

sudo pip3 install xpe

Once installed, you can use it to run xpath expressions against various kinds of input with the same level of flexibility you would get from using xpaths in selenium or javascript. Yeah, you can use xpaths against HTML with this. One caviat: if you run it against xml that has the encoding specified on the first line, it will fail. The solution for now is to pipe it to sed to remove the first line like so:

cat specified_encoding.xml | sed '1d' | xpe '//text()'

I may post an update at some point to address this issue to avoid the need for sed.


You might also be interested in xsh. It features an interactive mode where you can do whatever you like with the document:

open 1.xml ;
ls //element/@id ;
for //p[@class="first"] echo text() ;

I wasn't happy with Python one-liners for HTML XPath queries, so I wrote my own. Assumes that you installed python-lxml package or ran pip install --user lxml:

function htmlxpath() { python -c 'for x in __import__("lxml.html").html.fromstring(__import__("sys").stdin.read()).xpath(__import__("sys").argv[1]): print(x)' $1 }

Once you have it, you can use it like in this example:

> curl -s https://slashdot.org | htmlxpath '//title/text()'
Slashdot: News for nerds, stuff that matters

Similar to Mike's and clacke's answers, here is the python one-liner (using python >= 2.5) to get the build version from a pom.xml file that gets around the fact that pom.xml files don't normally have a dtd or default namespace, so don't appear well-formed to libxml:

python -c "import xml.etree.ElementTree as ET; \
  print(ET.parse(open('pom.xml')).getroot().find('\
  {http://maven.apache.org/POM/4.0.0}version').text)"

Tested on Mac and Linux, and doesn't require any extra packages to be installed.


Here's one xmlstarlet use case to extract data from nested elements elem1, elem2 to one line of text from this type of XML (also showing how to handle namespaces):

<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<mydoctype xmlns="http://xml-namespace-uri" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://xml-namespace-uri http://xsd-uri" format="20171221A" date="2018-05-15">

  <elem1 time="0.586" length="10.586">
      <elem2 value="cue-in" type="outro" />
  </elem1>

</mydoctype>

The output will be

0.586 10.586 cue-in outro

In this snippet, -m matches the nested elem2, -v outputs attribute values (with expressions and relative addressing), -o literal text, -n adds a newline:

xml sel -N ns="http://xml-namespace-uri" -t -m '//ns:elem1/ns:elem2' \
 -v ../@time -o " " -v '../@time + ../@length' -o " " -v @value -o " " -v @type -n file.xml

If more attributes are needed from elem1, one can do it like this (also showing the concat() function):

xml sel -N ns="http://xml-namespace-uri" -t -m '//ns:elem1/ns:elem2/..' \
 -v 'concat(@time, " ", @time + @length, " ", ns:elem2/@value, " ", ns:elem2/@type)' -n file.xml

Note the (IMO unnecessary) complication with namespaces (ns, declared with -N), that had me almost giving up on xpath and xmlstarlet, and writing a quick ad-hoc converter.


In my search to query maven pom.xml files I ran accross this question. However I had the following limitations:

  • must run cross-platform.
  • must exist on all major linux distributions without any additional module installation
  • must handle complex xml-files such as maven pom.xml files
  • simple syntax

I have tried many of the above without success:

  • python lxml.etree is not part of the standard python distribution
  • xml.etree is but does not handle complex maven pom.xml files well, have not digged deep enough
  • python xml.etree does not handle maven pom.xml files for unknown reason
  • xmllint does not work either, core dumps often on ubuntu 12.04 "xmllint: using libxml version 20708"

The solution that I have come across that is stable, short and work on many platforms and that is mature is the rexml lib builtin in ruby:

ruby -r rexml/document -e 'include REXML; 
     puts XPath.first(Document.new($stdin), "/project/version/text()")' < pom.xml

What inspired me to find this one was the following articles:


Saxon will do this not only for XPath 2.0, but also for XQuery 1.0 and (in the commercial version) 3.0. It doesn't come as a Linux package, but as a jar file. Syntax (which you can easily wrap in a simple script) is

java net.sf.saxon.Query -s:source.xml -qs://element/attribute

2020 UPDATE

Saxon 10.0 includes the Gizmo tool, which can be used interactively or in batch from the command line. For example

java net.sf.saxon.Gizmo -s:source.xml
/>show //element/@attribute
/>quit

In addition to XML::XSH and XML::XSH2 there are some grep-like utilities suck as App::xml_grep2 and XML::Twig (which includes xml_grep rather than xml_grep2). These can be quite useful when working on a large or numerous XML files for quick oneliners or Makefile targets. XML::Twig is especially nice to work with for a perl scripting approach when you want to a a bit more processing than your $SHELL and xmllint xstlproc offer.

The numbering scheme in the application names indicates that the "2" versions are newer/later version of essentially the same tool which may require later versions of other modules (or of perl itself).


I've tried a couple of command line XPath utilities and when I realized I am spending too much time googling and figuring out how they work, so I wrote the simplest possible XPath parser in Python which did what I needed.

The script below shows the string value if the XPath expression evaluates to a string, or shows the entire XML subnode if the result is a node:

#!/usr/bin/env python
import sys
from lxml import etree

tree = etree.parse(sys.argv[1])
xpath = sys.argv[2]

for e in tree.xpath(xpath):

    if isinstance(e, str):
        print(e)
    else:
        print((e.text and e.text.strip()) or etree.tostring(e))

It uses lxml — a fast XML parser written in C which is not included in the standard python library. Install it with pip install lxml. On Linux/OSX might need prefixing with sudo.

Usage:

python xmlcat.py file.xml "//mynode"

lxml can also accept an URL as input:

python xmlcat.py http://example.com/file.xml "//mynode" 

Extract the url attribute under an enclosure node i.e. <enclosure url="http:...""..>):

python xmlcat.py xmlcat.py file.xml "//enclosure/@url"

Xpath in Google Chrome

As an unrelated side note: If by chance you want to run an XPath expression against the markup of a web page then you can do it straight from the Chrome devtools: right-click the page in Chrome > select Inspect, and then in the DevTools console paste your XPath expression as $x("//spam/eggs").

Get all authors on this page:

$x("//*[@class='user-details']/a/text()")

Examples related to xml

strange error in my Animation Drawable How do I POST XML data to a webservice with Postman? PHP XML Extension: Not installed How to add a Hint in spinner in XML Generating Request/Response XML from a WSDL Manifest Merger failed with multiple errors in Android Studio How to set menu to Toolbar in Android How to add colored border on cardview? Android: ScrollView vs NestedScrollView WARNING: Exception encountered during context initialization - cancelling refresh attempt

Examples related to shell

Comparing a variable with a string python not working when redirecting from bash script Get first line of a shell command's output How to run shell script file using nodejs? Run bash command on jenkins pipeline Way to create multiline comments in Bash? How to do multiline shell script in Ansible How to check if a file exists in a shell script How to check if an environment variable exists and get its value? Curl to return http status code along with the response docker entrypoint running bash script gets "permission denied"

Examples related to xpath

Xpath: select div that contains class AND whose specific child element contains text XPath: difference between dot and text() How to set "value" to input web element using selenium? How to click a href link using Selenium XPath: Get parent node from child node What is the difference between absolute and relative xpaths? Which is preferred in Selenium automation testing? How to use XPath preceding-sibling correctly Selenium and xPath - locating a link by containing text How to verify an XPath expression in Chrome Developers tool or Firefox's Firebug? Concatenate multiple node values in xpath

Examples related to cross-platform

How to execute XPath one-liners from shell? iOS / Android cross platform development How to detect reliably Mac OS X, iOS, Linux, Windows in C preprocessor? What is "stdafx.h" used for in Visual Studio? How to identify platform/compiler from preprocessor macros? How to get the home directory in Python? How to check if running in Cygwin, Mac or Linux? Automatically add all files in a folder to a target using CMake? How is Java platform-independent when it needs a JVM to run? Difference between "\n" and Environment.NewLine