I have a file that looks something like this:
<table name="content_analyzer" primary-key="id">
<type="global" />
</table>
<table name="content_analyzer2" primary-key="id">
<type="global" />
</table>
<table name="content_analyzer_items" primary-key="id">
<type="global" />
</table>
I need to extract anything within the quotes that follow name=
, i.e., content_analyzer
, content_analyzer2
and content_analyzer_items
.
I am doing this on a Linux box, so a solution using sed, perl, grep or bash is fine.
This question is related to
regex
perl
sed
html-parsing
text-extraction
If the structure of your xml (or text in general) is fixed, the easiest way is using cut
. For your specific case:
echo '<table name="content_analyzer" primary-key="id">
<type="global" />
</table>
<table name="content_analyzer2" primary-key="id">
<type="global" />
</table>
<table name="content_analyzer_items" primary-key="id">
<type="global" />
</table>' | grep name= | cut -f2 -d '"'
An HTML parser should be used for this purpose rather than regular expressions. A Perl program that makes use of HTML::TreeBuilder
:
#!/usr/bin/env perl
use strict;
use warnings;
use HTML::TreeBuilder;
my $tree = HTML::TreeBuilder->new_from_file( \*DATA );
my @elements = $tree->look_down(
sub { defined $_[0]->attr('name') }
);
for (@elements) {
print $_->attr('name'), "\n";
}
__DATA__
<table name="content_analyzer" primary-key="id">
<type="global" />
</table>
<table name="content_analyzer2" primary-key="id">
<type="global" />
</table>
<table name="content_analyzer_items" primary-key="id">
<type="global" />
</table>
content_analyzer
content_analyzer2
content_analyzer_items
Here's a solution using HTML tidy & xmlstarlet:
htmlstr='
<table name="content_analyzer" primary-key="id">
<type="global" />
</table>
<table name="content_analyzer2" primary-key="id">
<type="global" />
</table>
<table name="content_analyzer_items" primary-key="id">
<type="global" />
</table>
'
echo "$htmlstr" | tidy -q -c -wrap 0 -numeric -asxml -utf8 --merge-divs yes --merge-spans yes 2>/dev/null |
sed '/type="global"/d' |
xmlstarlet sel -N x="http://www.w3.org/1999/xhtml" -T -t -m "//x:table" -v '@name' -n
this could do it:
perl -ne 'if(m/name="(.*?)"/){ print $1 . "\n"; }'
If you're using Perl, download a module to parse the XML: XML::Simple, XML::Twig, or XML::LibXML. Don't re-invent the wheel.
The regular expression would be:
.+name="([^"]+)"
Then the grouping would be in the \1
Oops, the sed command has to precede the tidy command of course:
echo "$htmlstr" |
sed '/type="global"/d' |
tidy -q -c -wrap 0 -numeric -asxml -utf8 --merge-divs yes --merge-spans yes 2>/dev/null |
xmlstarlet sel -N x="http://www.w3.org/1999/xhtml" -T -t -m "//x:table" -v '@name' -n
Source: Stackoverflow.com