Capturing Groups From a Grep RegEx

Question

I ve got this little script in sh  Mac OSX 10 6  to look through an array of files  Google has stopped being helpful at this point   files    jpg  for f in  files     do         echo  f   grep -oEi   0-9     a-z     0-9a-z            name            echo  name     done   So far  obviously  to you shell gurus   name merely holds 0  1 or 2  depending on if grep found that the filename matched the matter provided  What I d like is to capture what s inside the parens   a-z    and store that to a variable   I d like to use grep only  if possible  If not  please no Python or Perl  etc  sed or something like it      I m new to shell and would like to attack this from the  nix purist angle   Also  as a super-cool bonus  I m curious as to how I can concatenate string in shell  Is the group I captured was the string  somename  stored in  name  and I wanted to add the string   jpg  to the end of it  could I cat  name   jpg    Please explain what s going on  if you ve got the time

User · Answer

This is a solution that uses gawk  It s something I find I need to use often so I created a function for it  function regex1   gawk  match  0    1    ary   print ary    2 - 1            to use just do    echo  hello world    regex1  hello s      world

User · Answer

str  quot 1w 2d 1h quot  regex  quot   0-9  w   0-9  d   0-9  h quot  if     str     regex    then     week  quot   BASH REMATCH 1   quot      day  quot   BASH REMATCH 2   quot      hour  quot   BASH REMATCH 3   quot      echo  week ---  day ----  hour fi  output  1 --- 2 ---- 1

User · Answer

I prefer the one line python or perl command  both often included in major linux disdribution echo     lt a href  quot http   stackoverflow com quot  gt   lt  a gt   lt a href  quot http   google com quot  gt   lt  a gt       python -c    import re import sys for i in sys stdin    g re match r    href  quot      quot    i     if g is not None      print g group 1     and to handle files  ls   txt   python -c    import sys import re for i in sys stdin    i i strip     f open i  quot r quot     for j in f      g re match r    href  quot      quot    j       if g is not None        print g group 1    f close

User · Answer

I realize that an answer was already accepted for this  but from a  strictly  nix purist angle  it seems like the right tool for the job is pcregrep  which doesn t seem to have been mentioned yet   Try changing the lines       echo  f   grep -oEi   0-9     a-z     0-9a-z        name      to the following       name   echo  f   pcregrep -o1 -Ei   0-9     a-z     0-9a-z       to get only the contents of the capturing group 1     The pcregrep tool utilizes all of the same syntax you ve already used with grep  but implements the functionality that you need   The parameter -o works just like the grep version if it is bare  but it also accepts a numeric parameter in pcregrep  which indicates which capturing group you want to show   With this solution there is a bare minimum of change required in the script   You simply replace one modular utility with another and tweak the parameters   Interesting Note  You can use multiple -o arguments to return multiple capture groups in the order in which they appear on the line

User · Answer

Not possible in just grep I believe  for sed   name  echo  f   sed -E  s   0-9     a-z     0-9a-z        2      I ll take a stab at the bonus though   echo   name jpg

User · Answer

This isn t really possible with pure grep  at least not generally   But if your pattern is suitable  you may be able to use grep multiple times within a pipeline to first reduce your line to a known format  and then to extract just the bit you want   Although tools like cut and sed are far better at this    Suppose for the sake of argument that your pattern was a bit simpler   0-9     a-z     You could extract this like so   echo  name   grep -Ei   0-9    a-z       grep -oEi   a-z      The first grep would remove any lines that didn t match your overall patern  the second grep  which has --only-matching specified  would display the alpha portion of the name  This only works because the pattern is suitable   alpha portion  is specific enough to pull out what you want    Aside  Personally I d use grep   cut to achieve what you are after  echo  name   grep  pattern    cut -d   -f 2  This gets cut to parse the line into fields by splitting on the delimiter    and returns just field 2  field numbers start at 1     Unix philosophy is to have tools which do one thing  and do it well  and combine them to achieve non-trivial tasks  so I d argue that grep   sed etc is a more Unixy way of doing things  -

User · Answer

A suggestion for you - you can use parameter expansion to remove the part of the name from the last underscore onwards  and similarly at the start   f 001 abc 0za jpg work   f     name   work       Then name will have the value abc   See Apple developer docs  search forward for  Parameter Expansion

User · Answer

if you have bash  you can use extended globbing  shopt -s extglob shopt -s nullglob shopt -s nocaseglob for file in    0-9      a-z      a-z0-9   jpg do    IFS        set --  file    echo  This is your captured output    2  done   or  ls    0-9      a-z      a-z0-9   jpg   while read file do    IFS        set --  file    echo  This is your captured output    2  done

User · Answer

If you re using Bash  you don t even have to use grep   files    jpg  regex   0-9     a-z     0-9a-z    for f in  files      unquoted in order to allow the glob to expand do     if     f     regex        then         name    BASH REMATCH 1            echo    name  jpg       concatenate strings         name    name  jpg       same thing stored in a variable     else         echo   f doesn t match   gt  amp 2   this could get noisy if there are a lot of non-matching files     fi done   It s better to put the regex in a variable  Some patterns won t work if included literally   This uses     which is Bash s regex match operator  The results of the match are saved to an array called  BASH REMATCH  The first capture group is stored in index 1  the second  if any  in index 2  etc  Index zero is the full match   You should be aware that without anchors  this regex  and the one using grep  will match any of the following examples and more  which may not be what you re looking for   123 abc d4e5 xyz123 abc d4e5 123 abc d4e5 xyz xyz123 abc d4e5 xyz   To eliminate the second and fourth examples  make your regex like this     0-9     a-z     0-9a-z     which says the string must start with one or more digits  The carat represents the beginning of the string  If you add a dollar sign at the end of the regex  like this     0-9     a-z     0-9a-z      then the third example will also be eliminated since the dot is not among the characters in the regex and the dollar sign represents the end of the string  Note that the fourth example fails this match as well   If you have GNU grep  around 2 5 or later  I think  when the  K operator was added    name   echo   f    grep -Po    i  0-9    K a-z       0-9a-z      jpg   The  K operator  variable-length look-behind  causes the preceding pattern to match  but doesn t include the match in the result  The fixed-length equivalent is    lt    - the pattern would be included before the closing parenthesis  You must use  K if quantifiers may match strings of different lengths  e g         2 4     The      operator matches fixed or variable-length patterns and is called  look-ahead   It also does not include the matched string in the result   In order to make the match case-insensitive  the   i  operator is used  It affects the patterns that follow it so its position is significant   The regex might need to be adjusted depending on whether there are other characters in the filename  You ll note that in this case  I show an example of concatenating a string at the same time that the substring is captured

[bash] Capturing Groups From a Grep RegEx

Examples related to bash

Examples related to shell

Examples related to grep