[regex] How to extract string following a pattern with grep, regex or perl

I have a file that looks something like this:

    <table name="content_analyzer" primary-key="id">
      <type="global" />
    </table>
    <table name="content_analyzer2" primary-key="id">
      <type="global" />
    </table>
    <table name="content_analyzer_items" primary-key="id">
      <type="global" />
    </table>

I need to extract anything within the quotes that follow name=, i.e., content_analyzer, content_analyzer2 and content_analyzer_items.

I am doing this on a Linux box, so a solution using sed, perl, grep or bash is fine.

This question is related to regex perl sed html-parsing text-extraction

The answer is


If the structure of your xml (or text in general) is fixed, the easiest way is using cut. For your specific case:

echo '<table name="content_analyzer" primary-key="id">
  <type="global" />
</table>
<table name="content_analyzer2" primary-key="id">
  <type="global" />
</table>
<table name="content_analyzer_items" primary-key="id">
  <type="global" />
</table>' | grep name= | cut -f2 -d '"'

An HTML parser should be used for this purpose rather than regular expressions. A Perl program that makes use of HTML::TreeBuilder:

Program

#!/usr/bin/env perl

use strict;
use warnings;

use HTML::TreeBuilder;

my $tree = HTML::TreeBuilder->new_from_file( \*DATA );
my @elements = $tree->look_down(
    sub { defined $_[0]->attr('name') }
);

for (@elements) {
    print $_->attr('name'), "\n";
}

__DATA__
<table name="content_analyzer" primary-key="id">
  <type="global" />
</table>
<table name="content_analyzer2" primary-key="id">
  <type="global" />
</table>
<table name="content_analyzer_items" primary-key="id">
  <type="global" />
</table>

Output

content_analyzer
content_analyzer2
content_analyzer_items

Here's a solution using HTML tidy & xmlstarlet:

htmlstr='
<table name="content_analyzer" primary-key="id">
<type="global" />
</table>
<table name="content_analyzer2" primary-key="id">
<type="global" />
</table>
<table name="content_analyzer_items" primary-key="id">
<type="global" />
</table>
'

echo "$htmlstr" | tidy -q -c -wrap 0 -numeric -asxml -utf8 --merge-divs yes --merge-spans yes 2>/dev/null |
sed '/type="global"/d' |
xmlstarlet sel -N x="http://www.w3.org/1999/xhtml" -T -t -m "//x:table" -v '@name' -n

this could do it:

perl -ne 'if(m/name="(.*?)"/){ print $1 . "\n"; }'

If you're using Perl, download a module to parse the XML: XML::Simple, XML::Twig, or XML::LibXML. Don't re-invent the wheel.


The regular expression would be:

.+name="([^"]+)"

Then the grouping would be in the \1


Oops, the sed command has to precede the tidy command of course:

echo "$htmlstr" | 
sed '/type="global"/d' |
tidy -q -c -wrap 0 -numeric -asxml -utf8 --merge-divs yes --merge-spans yes 2>/dev/null |
xmlstarlet sel -N x="http://www.w3.org/1999/xhtml" -T -t -m "//x:table" -v '@name' -n

Examples related to regex

Why my regexp for hyphenated words doesn't work? grep's at sign caught as whitespace Preg_match backtrack error regex match any single character (one character only) re.sub erroring with "Expected string or bytes-like object" Only numbers. Input number in React Visual Studio Code Search and Replace with Regular Expressions Strip / trim all strings of a dataframe return string with first match Regex How to capture multiple repeated groups?

Examples related to perl

The program can't start because api-ms-win-crt-runtime-l1-1-0.dll is missing while starting Apache server on my computer "End of script output before headers" error in Apache Perl - Multiple condition if statement without duplicating code? How to decrypt hash stored by bcrypt Split a string into array in Perl Turning multiple lines into one comma separated line String compare in Perl with "eq" vs "==" how to remove the first two columns in a file using shell (awk, sed, whatever) Find everything between two XML tags with RegEx Difference between \w and \b regular expression meta characters

Examples related to sed

Retrieve last 100 lines logs How to replace multiple patterns at once with sed? Insert multiple lines into a file after specified pattern using shell script Linux bash script to extract IP address Ansible playbook shell output remove white space from the end of line in linux bash, extract string before a colon invalid command code ., despite escaping periods, using sed RE error: illegal byte sequence on Mac OS X How to use variables in a command in sed?

Examples related to html-parsing

PHP: HTML: send HTML select option attribute in POST Read a HTML file into a string variable in memory Parsing HTML using Python Parse an HTML string with JS HTML Text with tags to formatted text in an Excel cell How do I parse a HTML page with Node.js Regex select all text between tags How to extract string following a pattern with grep, regex or perl How to strip HTML tags from string in JavaScript? How do you parse and process HTML/XML in PHP?

Examples related to text-extraction

How can I read pdf in python? Extracting text from a PDF file using PDFMiner in python? Getting URL parameter in java and extract a specific text from that URL Extract a single (unsigned) integer from a string How to extract string following a pattern with grep, regex or perl How to extract a substring using regex How to extract text from a PDF? PDF Parsing Using Python - extracting formatted and plain texts Python module for converting PDF to text