How to extract string following a pattern with grep regex or perl

Question

I have a file that looks something like this        lt table name  content analyzer  primary-key  id  gt         lt type  global    gt       lt  table gt       lt table name  content analyzer2  primary-key  id  gt         lt type  global    gt       lt  table gt       lt table name  content analyzer items  primary-key  id  gt         lt type  global    gt       lt  table gt    I need to extract anything within the quotes that follow name   i e   content analyzer  content analyzer2 and content analyzer items   I am doing this on a Linux box  so a solution using sed  perl  grep or bash is fine

User · Answer

An HTML parser should be used for this purpose rather than regular expressions  A Perl program that makes use of HTML  TreeBuilder   Program     usr bin env perl  use strict  use warnings   use HTML  TreeBuilder   my  tree   HTML  TreeBuilder- gt new from file    DATA    my  elements    tree- gt look down      sub   defined    0 - gt attr  name         for   elements        print   - gt attr  name      n        DATA    lt table name  content analyzer  primary-key  id  gt     lt type  global    gt   lt  table gt   lt table name  content analyzer2  primary-key  id  gt     lt type  global    gt   lt  table gt   lt table name  content analyzer items  primary-key  id  gt     lt type  global    gt   lt  table gt    Output  content analyzer content analyzer2 content analyzer items

User · Answer

If the structure of your xml  or text in general  is fixed  the easiest way is using cut  For your specific case   echo   lt table name  content analyzer  primary-key  id  gt     lt type  global    gt   lt  table gt   lt table name  content analyzer2  primary-key  id  gt     lt type  global    gt   lt  table gt   lt table name  content analyzer items  primary-key  id  gt     lt type  global    gt   lt  table gt     grep name    cut -f2 -d

User · Answer

this could do it   perl -ne  if m name            print  1     n

User · Answer

Since you need to match content without including it in the result  must match name    but it s not  part of the  desired result  some  form of zero-width matching  or group  capturing is required   This can  be done easily with the following tools   Perl  With Perl you  could use the n  option to loop line by  line and print the content of a capturing group if it matches   perl -ne  print   1 n  if  name           filename   GNU grep  If you have an improved version of  grep  such as GNU grep  you may have the  -P option  available  This  option will  enable Perl-like  regex  allowing you to use  K which  is a shorthand lookbehind  It will reset the match position  so anything before it is zero-width   grep -Po  name   K          filename   The o  option makes grep print  only the matched text   instead of the whole line   Vim - Text Editor  Another way  is to  use a  text editor  directly  With  Vim  one  of the various  ways of  accomplishing this  would be  to delete  lines without name  and then extract the content from the resulting lines    v   name   v          d  s   1     Standard grep  If you  don t have  access to  these tools   for some  reason  something similar could be achieved with  standard grep  However  without the look around it will require some cleanup later   grep -o  name          filename     A note about saving results  In all of the commands above the  results will be sent to stdout  It s important to remember  that you can always  save them by piping  it to a file by appending    gt  result   to the end of the command

User · Answer

Here s a solution using HTML tidy  amp  xmlstarlet   htmlstr    lt table name  content analyzer  primary-key  id  gt   lt type  global    gt   lt  table gt   lt table name  content analyzer2  primary-key  id  gt   lt type  global    gt   lt  table gt   lt table name  content analyzer items  primary-key  id  gt   lt type  global    gt   lt  table gt     echo   htmlstr    tidy -q -c -wrap 0 -numeric -asxml -utf8 --merge-divs yes --merge-spans yes 2 gt  dev null   sed   type  global  d    xmlstarlet sel -N x  http   www w3 org 1999 xhtml  -T -t -m    x table  -v   name  -n

User · Answer

If you re using Perl  download a module to parse the XML  XML  Simple  XML  Twig  or XML  LibXML   Don t re-invent the wheel

User · Answer

Oops  the sed command has to precede the tidy command of course   echo   htmlstr     sed   type  global  d    tidy -q -c -wrap 0 -numeric -asxml -utf8 --merge-divs yes --merge-spans yes 2 gt  dev null   xmlstarlet sel -N x  http   www w3 org 1999 xhtml  -T -t -m    x table  -v   name  -n

User · Answer

The regular expression would be     name             Then the grouping would be in the  1

[regex] How to extract string following a pattern with grep, regex or perl

Program

Output

Examples related to regex

Examples related to perl

Examples related to sed

Examples related to html-parsing

Examples related to text-extraction