[regex] How to output only captured groups with sed?

Is there any way to tell sed to output only captured groups? For example given the input:

This is a sample 123 text and some 987 numbers

and pattern:

/([\d]+)/

Could I get only 123 and 987 output in the way formatted by back references?

This question is related to regex sed

The answer is


The key to getting this to work is to tell sed to exclude what you don't want to be output as well as specifying what you do want.

string='This is a sample 123 text and some 987 numbers'
echo "$string" | sed -rn 's/[^[:digit:]]*([[:digit:]]+)[^[:digit:]]+([[:digit:]]+)[^[:digit:]]*/\1 \2/p'

This says:

  • don't default to printing each line (-n)
  • exclude zero or more non-digits
  • include one or more digits
  • exclude one or more non-digits
  • include one or more digits
  • exclude zero or more non-digits
  • print the substitution (p)

In general, in sed you capture groups using parentheses and output what you capture using a back reference:

echo "foobarbaz" | sed 's/^foo\(.*\)baz$/\1/'

will output "bar". If you use -r (-E for OS X) for extended regex, you don't need to escape the parentheses:

echo "foobarbaz" | sed -r 's/^foo(.*)baz$/\1/'

There can be up to 9 capture groups and their back references. The back references are numbered in the order the groups appear, but they can be used in any order and can be repeated:

echo "foobarbaz" | sed -r 's/^foo(.*)b(.)z$/\2 \1 \2/'

outputs "a bar a".

If you have GNU grep (it may also work in BSD, including OS X):

echo "$string" | grep -Po '\d+'

or variations such as:

echo "$string" | grep -Po '(?<=\D )(\d+)'

The -P option enables Perl Compatible Regular Expressions. See man 3 pcrepattern or man 3 pcresyntax.


Give up and use Perl

Since sed does not cut it, let's just throw the towel and use Perl, at least it is LSB while grep GNU extensions are not :-)

  • Print the entire matching part, no matching groups or lookbehind needed:

    cat <<EOS | perl -lane 'print m/\d+/g'
    a1 b2
    a34 b56
    EOS
    

    Output:

    12
    3456
    
  • Single match per line, often structured data fields:

    cat <<EOS | perl -lape 's/.*?a(\d+).*/$1/g'
    a1 b2
    a34 b56
    EOS
    

    Output:

    1
    34
    

    With lookbehind:

    cat <<EOS | perl -lane 'print m/(?<=a)(\d+)/'
    a1 b2
    a34 b56
    EOS
    
  • Multiple fields:

    cat <<EOS | perl -lape 's/.*?a(\d+).*?b(\d+).*/$1 $2/g'
    a1 c0 b2 c0
    a34 c0 b56 c0
    EOS
    

    Output:

    1 2
    34 56
    
  • Multiple matches per line, often unstructured data:

    cat <<EOS | perl -lape 's/.*?a(\d+)|.*/$1 /g'
    a1 b2
    a34 b56 a78 b90
    EOS
    

    Output:

    1 
    34 78
    

    With lookbehind:

    cat EOS<< | perl -lane 'print m/(?<=a)(\d+)/g'
    a1 b2
    a34 b56 a78 b90
    EOS
    

    Output:

    1
    3478
    

It's not what the OP asked for (capturing groups) but you can extract the numbers using:

S='This is a sample 123 text and some 987 numbers'
echo "$S" | sed 's/ /\n/g' | sed -r '/([0-9]+)/ !d'

Gives the following:

123
987

you can use grep

grep -Eow "[0-9]+" file

Try

sed -n -e "/[0-9]/s/^[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\).*$/\1 \2 \3 \4 \5 \6 \7 \8 \9/p"

I got this under cygwin:

$ (echo "asdf"; \
   echo "1234"; \
   echo "asdf1234adsf1234asdf"; \
   echo "1m2m3m4m5m6m7m8m9m0m1m2m3m4m5m6m7m8m9") | \
  sed -n -e "/[0-9]/s/^[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\).*$/\1 \2 \3 \4 \5 \6 \7 \8 \9/p"

1234
1234 1234
1 2 3 4 5 6 7 8 9
$

run(s) of digits

This answer works with any count of digit groups. Example:

$ echo 'Num123that456are7899900contained0018166intext' \
   | sed -En 's/[^0-9]*([0-9]{1,})[^0-9]*/\1 /gp'

123 456 7899900 0018166

Expanded answer.

Is there any way to tell sed to output only captured groups?

Yes. replace all text by the capture group:

$ echo 'Number 123 inside text' \
   | sed 's/[^0-9]*\([0-9]\{1,\}\)[^0-9]*/\1/'

123
s/[^0-9]*                           # several non-digits
         \([0-9]\{1,\}\)            # followed by one or more digits
                        [^0-9]*     # and followed by more non-digits.
                               /\1/ # gets replaced only by the digits.

Or with extended syntax (less backquotes and allow the use of +):

$ echo 'Number 123 in text' \
   | sed -E 's/[^0-9]*([0-9]+)[^0-9]*/\1/'

123

To avoid printing the original text when there is no number, use:

$ echo 'Number xxx in text' \
   | sed -En 's/[^0-9]*([0-9]+)[^0-9]*/\1/p'
  • (-n) Do not print the input by default.
  • (/p) print only if a replacement was done.

And to match several numbers (and also print them):

$ echo 'N 123 in 456 text' \
  | sed -En 's/[^0-9]*([0-9]+)[^0-9]*/\1 /gp'

123 456

That works for any count of digit runs:

$ str='Test Num(s) 123 456 7899900 contained as0018166df in text'
$ echo "$str" \
   | sed -En 's/[^0-9]*([0-9]{1,})[^0-9]*/\1 /gp'

123 456 7899900 0018166

Which is very similar to the grep command:

$ str='Test Num(s) 123 456 7899900 contained as0018166df in text'
$ echo "$str" | grep -Po '\d+'
123
456
7899900
0018166

About \d

and pattern: /([\d]+)/

Sed does not recognize the '\d' (shortcut) syntax. The ascii equivalent used above [0-9] is not exactly equivalent. The only alternative solution is to use a character class: '[[:digit:]]`.

The selected answer use such "character classes" to build a solution:

$ str='This is a sample 123 text and some 987 numbers'
$ echo "$str" | sed -rn 's/[^[:digit:]]*([[:digit:]]+)[^[:digit:]]+([[:digit:]]+)[^[:digit:]]*/\1 \2/p'

That solution only works for (exactly) two runs of digits.

Of course, as the answer is being executed inside the shell, we can define a couple of variables to make such answer shorter:

$ str='This is a sample 123 text and some 987 numbers'
$ d=[[:digit:]]     D=[^[:digit:]]
$ echo "$str" | sed -rn "s/$D*($d+)$D+($d+)$D*/\1 \2/p"

But, as has been already explained, using a s/…/…/gp command is better:

$ str='This is 75577 a sam33ple 123 text and some 987 numbers'
$ d=[[:digit:]]     D=[^[:digit:]]
$ echo "$str" | sed -rn "s/$D*($d+)$D*/\1 /gp"
75577 33 123 987

That will cover both repeated runs of digits and writing a short(er) command.


Sed has up to nine remembered patterns but you need to use escaped parentheses to remember portions of the regular expression.

See here for examples and more detail


I believe the pattern given in the question was by way of example only, and the goal was to match any pattern.

If you have a sed with the GNU extension allowing insertion of a newline in the pattern space, one suggestion is:

> set string = "This is a sample 123 text and some 987 numbers"
>
> set pattern = "[0-9][0-9]*"
> echo $string | sed "s/$pattern/\n&\n/g" | sed -n "/$pattern/p"
123
987
> set pattern = "[a-z][a-z]*"
> echo $string | sed "s/$pattern/\n&\n/g" | sed -n "/$pattern/p"
his
is
a
sample
text
and
some
numbers

These examples are with tcsh (yes, I know its the wrong shell) with CYGWIN. (Edit: For bash, remove set, and the spaces around =.)