Non greedy reluctant regex matching in sed

Question

I m trying to use sed to clean up lines of URLs to extract just the domain   So from   http   www suepearson co uk product 174 71 3816    I want   http   www suepearson co uk     either with or without the trailing slash  it doesn t matter   I have tried    sed  s   http                1     and  escaping the non-greedy quantifier   sed  s   http                 1     but I can not seem to get the non-greedy quantifier     to work  so it always ends up matching the whole string

User · Answer

sed -E interprets regular expressions as extended  modern  regular expressions  Update  -E on MacOS X  -r in GNU sed

User · Answer

Simulating lazy  un-greedy  quantifier in sed And all other regex flavors   Finding first occurrence of an expression   POSIX ERE  using -r option  Regex     EXPRESSION       Sed    sed -r   s  EXPRESSION       1 g    Global  g  modifier should be on  Example  finding first sequence of digits  Live demo      sed -r  s   0-9         1 g   lt  lt  lt   foo 12 bar 34      12  How does it work  This regex benefits from an alternation    At each position engine tries to pick the longest match  this is a POSIX standard which is followed by couple of other engines as well  which means it goes with   until a match is found for   0-9       But order is important too     Since global flag is set  engine tries to continue matching character by character up to the end of input string or our target  As soon as the first and only capturing group  of left side of alternation is matched  EXPRESSION  rest of line is consumed immediately as well     We now hold our value in the first capturing group   POSIX BRE Regex          EXPRESSION             Sed    sed  s       EXPRESSION             3    Example  finding first sequence of digits       sed  s        0-9   1                3    lt  lt  lt   foo 12 bar 34      12  This one is like ERE version but with no alternation involved  That s all  At each single position engine tries to match a digit     If it is found  other following digits are consumed and captured and the rest of line is matched immediately otherwise since   means more or zero it skips over second capturing group      0-9   1           and arrives at a dot   to match a single character and this process continues     Finding first occurrence of a delimited expression  This approach will match the very first occurrence of a string that is delimited  We can call it a block of string  sed  s   END-DELIMITER-EXPRESSION      1          s     START-DELIMITER-EXPRESSION           1 g   Input string  foobar start block  1 end barfoo start block  2 end  -EDE  end -SDE  start   sed  s   end      1   s     start           1 g   Output  start block  1 end  First regex   end     matches and captures first end delimiter end and substitues all match with recent captured characters which is the end delimiter  At this stage our output is  foobar start block  1 end     Then the result is passed to second regex     start          that is same as POSIX BRE version above  It matches a single character if start delimiter start is not matched otherwise it matches and captures the start delimiter and matches the rest of characters        Directly answering your question Using approach  2  delimited expression  you should select two appropriate expressions   EDE           SDE  http    Usage    sed  s                1 g  s     http            1    lt  lt  lt   http   www suepearson co uk product 174 71 3816    Output  http   www suepearson co uk   Note  this will not work with identical delimiters

User · Answer

another way  not using regex  is to use fields delimiter method eg  string  http   www suepearson co uk product 174 71 3816   echo  string   awk -F      print  1  2  3   OFS

User · Answer

sed certainly has its place but this not not one of them    As Dee has pointed out   Just use cut  It is far simpler and much more safe in this case  Here s an example where we extract various components from the URL using Bash syntax   url  http   www suepearson co uk product 174 71 3816    protocol   echo   url    cut -d    -f1  host   echo   url    cut -d    -f3  urlhost   echo   url    cut -d    -f1-3  urlpath   echo   url    cut -d    -f4-    gives you   protocol    http  host    www suepearson co uk  urlhost    http   www suepearson co uk  urlpath    product 174 71 3816     As you can see this is a lot more flexible approach    all credit to Dee

User · Answer

In this specific case  you can get the job done without using a non-greedy regex    Try this non-greedy regex       instead of       sed  s   http               1 g

User · Answer

This is how to robustly do non-greedy matching of multi-character strings using sed  Lets say you want to change every foo   bar to  lt foo   bar gt  so for example this input     cat file ABC foo DEF bar GHI foo KLM bar NOP foo QRS bar TUV   should become this output   ABC  lt foo DEF bar gt  GHI  lt foo KLM bar gt  NOP  lt foo QRS bar gt  TUV   To do that you convert foo and bar to individual characters and then use the negation of those characters between them     sed  s    A g  s    B g  s    C g  s foo   g  s bar   g  s           lt  amp  gt  g  s   bar g  s   foo g  s  C   g  s  B   g  s  A   g  file ABC  lt foo DEF bar gt  GHI  lt foo KLM bar gt  NOP  lt foo QRS bar gt  TUV   In the above    s    A g  s    B g  s    C g is converting   and   to placeholder strings that cannot exist in the input so those chars then are available to convert foo and bar to  s foo   g  s bar   g is converting foo and bar to   and   respectively s           lt  amp  gt  g is performing the op we want - converting foo   bar to  lt foo   bar gt  s   bar g  s   foo g is converting   and   back to foo and bar  s  C   g  s  B   g  s  A   g is converting the placeholder strings back to their original characters    Note that the above does not rely on any particular string not being present in the input as it manufactures such strings in the first step  nor does it care which occurrence of any particular regexp you want to match since you can use          as many times as necessary in the expression to isolate the actual match you want and or with seds numeric match operator  e g  to only replace the 2nd occurrence     sed  s    A g  s    B g  s    C g  s foo   g  s bar   g  s           lt  amp  gt  2  s   bar g  s   foo g  s  C   g  s  B   g  s  A   g  file ABC foo DEF bar GHI  lt foo KLM bar gt  NOP foo QRS bar TUV

User · Answer

sed does not support  non greedy  operator   You have to use      operator to exclude     from match   sed  s   http               1     P S  there is no need to backslash

User · Answer

There is still hope to solve this using pure  GNU  sed  Despite this is not a generic solution in some cases you can use  loops  to eliminate all the unnecessary parts of the string like this   sed -r -e   loop  -e  s  http           1   -e  t loop     -r  Use extended regex  for   and unescaped parenthesis    loop   Define a new label named  loop  -e  add commands to sed  t loop   Jump back to label  loop  if there was a successful substitution   The only problem here is it will also cut the last separator character        but if you really need it you can still simply put it back after the  loop  finished  just append this additional command at the end of the previous command line   -e  s

User · Answer

I realize this is an old entry  but someone may find it useful  As the full domain name may not exceed a total length of 253 characters replace    with    1  255

User · Answer

sed  s   http     www   a-z 0-9          1  works too

User · Answer

echo   home one two three myfile txt    sed  s            1     don bother  i got it on another forum

User · Answer

Non-greedy solution for more than a single character  This thread is really old but I assume people still needs it  Lets say you want to kill everything till the very first occurrence of HELLO  You cannot say   HELLO      So a nice solution involves two steps  assuming that you can spare a unique word that you are not expecting in the input  say top sekrit   In this case we can   s HELLO top sekrit       will only replace the very first occurrence s   top sekrit           kill everything till end of the first HELLO   Of course  with a simpler input you could use a smaller word  or maybe even a single character   HTH

User · Answer

Neither basic nor extended Posix GNU regex recognizes the non-greedy quantifier  you need a later regex   Fortunately  Perl regex for this context is pretty easy to get   perl -pe  s  http            1

User · Answer

Daniel H  concerning your comment on andcoz  answer  although long time ago   deleting trailing zeros works with s     digit        digit     1-9   0     1 g it s about clearly defining the matching conditions

User · Answer

Another sed version   sed  s    alnum        file txt   It matches   followed by an alphanumeric character  so not another forward slash  as well as the rest of characters till the end of the line  Afterwards it replaces it with nothing  ie  deletes it

User · Answer

sed  s  http                  1

User · Answer

Here is something you can do with a two step approach and awk   A http   www suepearson co uk product 174 71 3816    echo  A awk           var gensub          3  0        sub             var       print var             Output    http   www suepearson co uk   Hope that helps

User · Answer

Have not yet seen this answer  so here s how you can do this with vi or vim   vi -c   s   http        -         1 ge   wq  file  amp  gt  dev null   This runs the vi   s substitution globally  the trailing g   refrains from raising an error if the pattern is not found  e   then saves the resulting changes to disk and quits  The  amp  gt  dev null prevents the GUI from briefly flashing on screen  which can be annoying   I like using vi sometimes for super complicated regexes  because  1  perl is dead dying   2  vim has a very advanced regex engine  and  3  I m already intimately familiar with vi regexes in my day-to-day usage editing documents

User · Answer

Because you specifically stated you re trying to use sed  instead of perl  cut  etc    try grouping  This circumvents the non-greedy identifier potentially not being recognized  The first group is the protocol  i e   http       https       tcp      etc   The second group is the domain     echo  http   www suon co uk product 1 7 3     sed  s                        1 2     If you re not familiar with grouping  start here

User · Answer

With sed  I usually implement non-greedy search by searching for anything except the separator until the separator    echo  http   www suon co uk product 1 7 3     sed -n  s   http               1 p    Output   http   www suon co uk   this is    don t output -n search  match pattern  replace and print s  lt pattern gt   lt replace gt  p use   search command separator instead of   to make it easier to type so s  lt pattern gt   lt replace gt  p remember match between brackets            later accessible with  1  2    match http    followed by anything in brackets      ab   would mean either a or b or    first   in    means not  so followed by anything but the thing in the    so      means anything except   character   is to repeat previous group so       means characters except    so far sed -n  s   http           means search and remember http   followed by any characters except   and remember what you ve found we want to search untill the end of domain so stop on the next   so add another   at the end  sed -n  s   http             but we want to match the rest of the line after the domain so add    now the match remembered in group 1   1  is the domain so replace matched line with stuff saved in group  1 and print  sed -n  s   http               1 p     If you want to include backslash after the domain as well  then add one more backslash in the group to remember   echo  http   www suon co uk product 1 7 3     sed -n  s   http               1 p    output   http   www suon co uk

User · Answer

This can be done using cut   echo  http   www suepearson co uk product 174 71 3816     cut -d    -f1-3

User · Answer

sed - non greedy matching by Christoph Sieghart  The trick to get non greedy matching in sed is to match all characters excluding the one that terminates the match  I know  a no-brainer  but I wasted precious minutes on it and shell scripts should be  after all  quick and easy  So in case somebody else might need it   Greedy matching    echo   lt b gt foo lt  b gt bar    sed  s  lt    gt   g  bar   Non greedy matching    echo   lt b gt foo lt  b gt bar    sed  s  lt    gt    gt   g  foobar

[regex] Non greedy (reluctant) regex matching in sed?

Examples related to regex

Examples related to sed

Examples related to pcre

Examples related to greedy

Examples related to regex-greedy