[regex] Non greedy (reluctant) regex matching in sed?

I'm trying to use sed to clean up lines of URLs to extract just the domain.

So from:

http://www.suepearson.co.uk/product/174/71/3816/

I want:

http://www.suepearson.co.uk/

(either with or without the trailing slash, it doesn't matter)

I have tried:

 sed 's|\(http:\/\/.*?\/\).*|\1|'

and (escaping the non-greedy quantifier)

sed 's|\(http:\/\/.*\?\/\).*|\1|'

but I can not seem to get the non-greedy quantifier (?) to work, so it always ends up matching the whole string.

This question is related to regex sed pcre greedy regex-greedy

The answer is


In this specific case, you can get the job done without using a non-greedy regex.

Try this non-greedy regex [^/]* instead of .*?:

sed 's|\(http://[^/]*/\).*|\1|g'

Simulating lazy (un-greedy) quantifier in sed

And all other regex flavors!

  1. Finding first occurrence of an expression:

    • POSIX ERE (using -r option)

      Regex:

        (EXPRESSION).*|.
      

      Sed:

        sed -r ?'s/(EXPRESSION).*|./\1/g' # Global `g` modifier should be on
      

      Example (finding first sequence of digits) Live demo:

        $ sed -r 's/([0-9]+).*|./\1/g' <<< 'foo 12 bar 34'
      
        12
      

      How does it work?

      This regex benefits from an alternation |. At each position engine tries to pick the longest match (this is a POSIX standard which is followed by couple of other engines as well) which means it goes with . until a match is found for ([0-9]+).*. But order is important too.

      enter image description here

      Since global flag is set, engine tries to continue matching character by character up to the end of input string or our target. As soon as the first and only capturing group of left side of alternation is matched (EXPRESSION) rest of line is consumed immediately as well .*. We now hold our value in the first capturing group.

    • POSIX BRE

      Regex:

        \(\(\(EXPRESSION\).*\)*.\)*
      

      Sed:

        sed 's/\(\(\(EXPRESSION\).*\)*.\)*/\3/'
      

      Example (finding first sequence of digits):

        $ sed 's/\(\(\([0-9]\{1,\}\).*\)*.\)*/\3/' <<< 'foo 12 bar 34'
      
        12
      

      This one is like ERE version but with no alternation involved. That's all. At each single position engine tries to match a digit.

      enter image description here

      If it is found, other following digits are consumed and captured and the rest of line is matched immediately otherwise since * means more or zero it skips over second capturing group \(\([0-9]\{1,\}\).*\)* and arrives at a dot . to match a single character and this process continues.

  2. Finding first occurrence of a delimited expression:

    This approach will match the very first occurrence of a string that is delimited. We can call it a block of string.

    sed 's/\(END-DELIMITER-EXPRESSION\).*/\1/; \
         s/\(\(START-DELIMITER-EXPRESSION.*\)*.\)*/\1/g'
    

    Input string:

    foobar start block #1 end barfoo start block #2 end
    

    -EDE: end

    -SDE: start

    $ sed 's/\(end\).*/\1/; s/\(\(start.*\)*.\)*/\1/g'
    

    Output:

    start block #1 end
    

    First regex \(end\).* matches and captures first end delimiter end and substitues all match with recent captured characters which is the end delimiter. At this stage our output is: foobar start block #1 end.

    enter image description here

    Then the result is passed to second regex \(\(start.*\)*.\)* that is same as POSIX BRE version above. It matches a single character if start delimiter start is not matched otherwise it matches and captures the start delimiter and matches the rest of characters.

    enter image description here


Directly answering your question

Using approach #2 (delimited expression) you should select two appropriate expressions:

  • EDE: [^:/]\/

  • SDE: http:

Usage:

$ sed 's/\([^:/]\/\).*/\1/g; s/\(\(http:.*\)*.\)*/\1/' <<< 'http://www.suepearson.co.uk/product/174/71/3816/'

Output:

http://www.suepearson.co.uk/

Note: this will not work with identical delimiters.


sed -E interprets regular expressions as extended (modern) regular expressions

Update: -E on MacOS X, -r in GNU sed.


another way, not using regex, is to use fields/delimiter method eg

string="http://www.suepearson.co.uk/product/174/71/3816/"
echo $string | awk -F"/" '{print $1,$2,$3}' OFS="/"

sed 's|(http:\/\/[^\/]+\/).*|\1|'

Here is something you can do with a two step approach and awk:

A=http://www.suepearson.co.uk/product/174/71/3816/  
echo $A|awk '  
{  
  var=gensub(///,"||",3,$0) ;  
  sub(/\|\|.*/,"",var);  
  print var  
}'  

Output: http://www.suepearson.co.uk

Hope that helps!


This is how to robustly do non-greedy matching of multi-character strings using sed. Lets say you want to change every foo...bar to <foo...bar> so for example this input:

$ cat file
ABC foo DEF bar GHI foo KLM bar NOP foo QRS bar TUV

should become this output:

ABC <foo DEF bar> GHI <foo KLM bar> NOP <foo QRS bar> TUV

To do that you convert foo and bar to individual characters and then use the negation of those characters between them:

$ sed 's/@/@A/g; s/{/@B/g; s/}/@C/g; s/foo/{/g; s/bar/}/g; s/{[^{}]*}/<&>/g; s/}/bar/g; s/{/foo/g; s/@C/}/g; s/@B/{/g; s/@A/@/g' file
ABC <foo DEF bar> GHI <foo KLM bar> NOP <foo QRS bar> TUV

In the above:

  1. s/@/@A/g; s/{/@B/g; s/}/@C/g is converting { and } to placeholder strings that cannot exist in the input so those chars then are available to convert foo and bar to.
  2. s/foo/{/g; s/bar/}/g is converting foo and bar to { and } respectively
  3. s/{[^{}]*}/<&>/g is performing the op we want - converting foo...bar to <foo...bar>
  4. s/}/bar/g; s/{/foo/g is converting { and } back to foo and bar.
  5. s/@C/}/g; s/@B/{/g; s/@A/@/g is converting the placeholder strings back to their original characters.

Note that the above does not rely on any particular string not being present in the input as it manufactures such strings in the first step, nor does it care which occurrence of any particular regexp you want to match since you can use {[^{}]*} as many times as necessary in the expression to isolate the actual match you want and/or with seds numeric match operator, e.g. to only replace the 2nd occurrence:

$ sed 's/@/@A/g; s/{/@B/g; s/}/@C/g; s/foo/{/g; s/bar/}/g; s/{[^{}]*}/<&>/2; s/}/bar/g; s/{/foo/g; s/@C/}/g; s/@B/{/g; s/@A/@/g' file
ABC foo DEF bar GHI <foo KLM bar> NOP foo QRS bar TUV

echo "/home/one/two/three/myfile.txt" | sed 's|\(.*\)/.*|\1|'

don bother, i got it on another forum :)


sed 's|\(http:\/\/www\.[a-z.0-9]*\/\).*|\1| works too


There is still hope to solve this using pure (GNU) sed. Despite this is not a generic solution in some cases you can use "loops" to eliminate all the unnecessary parts of the string like this:

sed -r -e ":loop" -e 's|(http://.+)/.*|\1|' -e "t loop"
  • -r: Use extended regex (for + and unescaped parenthesis)
  • ":loop": Define a new label named "loop"
  • -e: add commands to sed
  • "t loop": Jump back to label "loop" if there was a successful substitution

The only problem here is it will also cut the last separator character ('/'), but if you really need it you can still simply put it back after the "loop" finished, just append this additional command at the end of the previous command line:

-e "s,$,/,"

sed does not support "non greedy" operator.

You have to use "[]" operator to exclude "/" from match.

sed 's,\(http://[^/]*\)/.*,\1,'

P.S. there is no need to backslash "/".


Non-greedy solution for more than a single character

This thread is really old but I assume people still needs it. Lets say you want to kill everything till the very first occurrence of HELLO. You cannot say [^HELLO]...

So a nice solution involves two steps, assuming that you can spare a unique word that you are not expecting in the input, say top_sekrit.

In this case we can:

s/HELLO/top_sekrit/     #will only replace the very first occurrence
s/.*top_sekrit//        #kill everything till end of the first HELLO

Of course, with a simpler input you could use a smaller word, or maybe even a single character.

HTH!


Another sed version:

sed 's|/[:alnum:].*||' file.txt

It matches / followed by an alphanumeric character (so not another forward slash) as well as the rest of characters till the end of the line. Afterwards it replaces it with nothing (ie. deletes it.)


I realize this is an old entry, but someone may find it useful. As the full domain name may not exceed a total length of 253 characters replace .* with .\{1, 255\}


This can be done using cut:

echo "http://www.suepearson.co.uk/product/174/71/3816/" | cut -d'/' -f1-3

@Daniel H (concerning your comment on andcoz' answer, although long time ago): deleting trailing zeros works with

s,([[:digit:]]\.[[:digit:]]*[1-9])[0]*$,\1,g

it's about clearly defining the matching conditions ...


Because you specifically stated you're trying to use sed (instead of perl, cut, etc.), try grouping. This circumvents the non-greedy identifier potentially not being recognized. The first group is the protocol (i.e. 'http://', 'https://', 'tcp://', etc). The second group is the domain:

echo "http://www.suon.co.uk/product/1/7/3/" | sed "s|^\(.*//\)\([^/]*\).*$|\1\2|"

If you're not familiar with grouping, start here.


sed certainly has its place but this not not one of them !

As Dee has pointed out: Just use cut. It is far simpler and much more safe in this case. Here's an example where we extract various components from the URL using Bash syntax:

url="http://www.suepearson.co.uk/product/174/71/3816/"

protocol=$(echo "$url" | cut -d':' -f1)
host=$(echo "$url" | cut -d'/' -f3)
urlhost=$(echo "$url" | cut -d'/' -f1-3)
urlpath=$(echo "$url" | cut -d'/' -f4-)

gives you:

protocol = "http"
host = "www.suepearson.co.uk"
urlhost = "http://www.suepearson.co.uk"
urlpath = "product/174/71/3816/"

As you can see this is a lot more flexible approach.

(all credit to Dee)


Have not yet seen this answer, so here's how you can do this with vi or vim:

vi -c '%s/\(http:\/\/.\{-}\/\).*/\1/ge | wq' file &>/dev/null

This runs the vi :%s substitution globally (the trailing g), refrains from raising an error if the pattern is not found (e), then saves the resulting changes to disk and quits. The &>/dev/null prevents the GUI from briefly flashing on screen, which can be annoying.

I like using vi sometimes for super complicated regexes, because (1) perl is dead dying, (2) vim has a very advanced regex engine, and (3) I'm already intimately familiar with vi regexes in my day-to-day usage editing documents.


sed - non greedy matching by Christoph Sieghart

The trick to get non greedy matching in sed is to match all characters excluding the one that terminates the match. I know, a no-brainer, but I wasted precious minutes on it and shell scripts should be, after all, quick and easy. So in case somebody else might need it:

Greedy matching

% echo "<b>foo</b>bar" | sed 's/<.*>//g'
bar

Non greedy matching

% echo "<b>foo</b>bar" | sed 's/<[^>]*>//g'
foobar

With sed, I usually implement non-greedy search by searching for anything except the separator until the separator :

echo "http://www.suon.co.uk/product/1/7/3/" | sed -n 's;\(http://[^/]*\)/.*;\1;p'

Output:

http://www.suon.co.uk

this is:

  • don't output -n
  • search, match pattern, replace and print s/<pattern>/<replace>/p
  • use ; search command separator instead of / to make it easier to type so s;<pattern>;<replace>;p
  • remember match between brackets \( ... \), later accessible with \1,\2...
  • match http://
  • followed by anything in brackets [], [ab/] would mean either a or b or /
  • first ^ in [] means not, so followed by anything but the thing in the []
  • so [^/] means anything except / character
  • * is to repeat previous group so [^/]* means characters except /.
  • so far sed -n 's;\(http://[^/]*\) means search and remember http://followed by any characters except / and remember what you've found
  • we want to search untill the end of domain so stop on the next / so add another / at the end: sed -n 's;\(http://[^/]*\)/' but we want to match the rest of the line after the domain so add .*
  • now the match remembered in group 1 (\1) is the domain so replace matched line with stuff saved in group \1 and print: sed -n 's;\(http://[^/]*\)/.*;\1;p'

If you want to include backslash after the domain as well, then add one more backslash in the group to remember:

echo "http://www.suon.co.uk/product/1/7/3/" | sed -n 's;\(http://[^/]*/\).*;\1;p'

output:

http://www.suon.co.uk/

Examples related to regex

Why my regexp for hyphenated words doesn't work? grep's at sign caught as whitespace Preg_match backtrack error regex match any single character (one character only) re.sub erroring with "Expected string or bytes-like object" Only numbers. Input number in React Visual Studio Code Search and Replace with Regular Expressions Strip / trim all strings of a dataframe return string with first match Regex How to capture multiple repeated groups?

Examples related to sed

Retrieve last 100 lines logs How to replace multiple patterns at once with sed? Insert multiple lines into a file after specified pattern using shell script Linux bash script to extract IP address Ansible playbook shell output remove white space from the end of line in linux bash, extract string before a colon invalid command code ., despite escaping periods, using sed RE error: illegal byte sequence on Mac OS X How to use variables in a command in sed?

Examples related to pcre

PHP regular expressions: No ending delimiter '^' found in Non greedy (reluctant) regex matching in sed? Invert match with regexp

Examples related to greedy

What is the difference between dynamic programming and greedy approach? Non greedy (reluctant) regex matching in sed?

Examples related to regex-greedy

How to capture multiple repeated groups? How can I write a regex which matches non greedy? Regex credit card number tests What is the difference between .*? and .* regular expressions? How to do a non-greedy match in grep? How to make Regular expression into non-greedy? What do 'lazy' and 'greedy' mean in the context of regular expressions? How can I make my match non greedy in vim? Non greedy (reluctant) regex matching in sed? Python non-greedy regexes