[string] How to use sed/grep to extract text between two words?

I am trying to output a string that contains everything between two words of a string:

input:

"Here is a String"

output:

"is a"

Using:

sed -n '/Here/,/String/p'

includes the endpoints, but I don't want to include them.

This question is related to string bash sed grep

The answer is


sed -e 's/Here\(.*\)String/\1/'

Through GNU awk,

$ echo "Here is a string" | awk -v FS="(Here|string)" '{print $2}'
 is a 

grep with -P(perl-regexp) parameter supports \K, which helps in discarding the previously matched characters. In our case , the previously matched string was Here so it got discarded from the final output.

$ echo "Here is a string" | grep -oP 'Here\K.*(?=string)'
 is a 
$ echo "Here is a string" | grep -oP 'Here\K(?:(?!string).)*'
 is a 

If you want the output to be is a then you could try the below,

$ echo "Here is a string" | grep -oP 'Here\s*\K.*(?=\s+string)'
is a
$ echo "Here is a string" | grep -oP 'Here\s*\K(?:(?!\s+string).)*'
is a

The accepted answer does not remove text that could be before Here or after String. This will:

sed -e 's/.*Here\(.*\)String.*/\1/'

The main difference is the addition of .* immediately before Here and after String.


If you have a long file with many multi-line ocurrences, it is useful to first print number lines:

cat -n file | sed -n '/Here/,/String/p'

You can use \1 (refer to http://www.grymoire.com/Unix/Sed.html#uh-4):

echo "Hello is a String" | sed 's/Hello\(.*\)String/\1/g'

The contents that is inside the brackets will be stored as \1.


Problem. My stored Claws Mail messages are wrapped as follows, and I am trying to extract the Subject lines:

Subject: [SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular
 link in major cell growth pathway: Findings point to new potential
 therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is
 Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as
 a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway
 identified [Lysosomal amino acid transporter SLC38A9 signals arginine
 sufficiency to mTORC1]]
Message-ID: <[email protected]>

Per A2 in this thread, How to use sed/grep to extract text between two words? the first expression, below, "works" as long as the matched text does not contain a newline:

grep -o -P '(?<=Subject: ).*(?=molecular)' corpus/01

[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key

However, despite trying numerous variants (.+?; /s; ...), I could not get these to work:

grep -o -P '(?<=Subject: ).*(?=link)' corpus/01
grep -o -P '(?<=Subject: ).*(?=therapeutic)' corpus/01
etc.

Solution 1.

Per Extract text between two strings on different lines

sed -n '/Subject: /{:a;N;/Message-ID:/!ba; s/\n/ /g; s/\s\s*/ /g; s/.*Subject: \|Message-ID:.*//g;p}' corpus/01

which gives

[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular link in major cell growth pathway: Findings point to new potential therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway identified [Lysosomal amino acid transporter SLC38A9 signals arginine sufficiency to mTORC1]]                              

Solution 2.*

Per How can I replace a newline (\n) using sed?

sed ':a;N;$!ba;s/\n/ /g' corpus/01

will replace newlines with a space.

Chaining that with A2 in How to use sed/grep to extract text between two words?, we get:

sed ':a;N;$!ba;s/\n/ /g' corpus/01 | grep -o -P '(?<=Subject: ).*(?=Message-ID:)'

which gives

[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular  link in major cell growth pathway: Findings point to new potential  therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is  Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as  a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway  identified [Lysosomal amino acid transporter SLC38A9 signals arginine  sufficiency to mTORC1]] 

This variant removes double spaces:

sed ':a;N;$!ba;s/\n/ /g; s/\s\s*/ /g' corpus/01 | grep -o -P '(?<=Subject: ).*(?=Message-ID:)'

giving

[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular link in major cell growth pathway: Findings point to new potential therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway identified [Lysosomal amino acid transporter SLC38A9 signals arginine sufficiency to mTORC1]]

You can use two s commands

$ echo "Here is a String" | sed 's/.*Here//; s/String.*//'
 is a 

Also works

$ echo "Here is a StringHere is a String" | sed 's/.*Here//; s/String.*//'
 is a

$ echo "Here is a StringHere is a StringHere is a StringHere is a String" | sed 's/.*Here//; s/String.*//'
 is a 

To understand sed command, we have to build it step by step.

Here is your original text

user@linux:~$ echo "Here is a String"
Here is a String
user@linux:~$ 

Let's try to remove Here string with substition option in sed

user@linux:~$ echo "Here is a String" | sed 's/Here //'
is a String
user@linux:~$ 

At this point, I believe you would be able to remove String as well

user@linux:~$ echo "Here is a String" | sed 's/String//'
Here is a
user@linux:~$ 

But this is not your desired output.

To combine two sed commands, use -e option

user@linux:~$ echo "Here is a String" | sed -e 's/Here //' -e 's/String//'
is a
user@linux:~$ 

Hope this helps


All the above solutions have deficiencies where the last search string is repeated elsewhere in the string. I found it best to write a bash function.

    function str_str {
      local str
      str="${1#*${2}}"
      str="${str%%$3*}"
      echo -n "$str"
    }

    # test it ...
    mystr="this is a string"
    str_str "$mystr" "this " " string"

You can strip strings in Bash alone:

$ foo="Here is a String"
$ foo=${foo##*Here }
$ echo "$foo"
is a String
$ foo=${foo%% String*}
$ echo "$foo"
is a
$

And if you have a GNU grep that includes PCRE, you can use a zero-width assertion:

$ echo "Here is a String" | grep -Po '(?<=(Here )).*(?= String)'
is a

This might work for you (GNU sed):

sed '/Here/!d;s//&\n/;s/.*\n//;:a;/String/bb;$!{n;ba};:b;s//\n&/;P;D' file 

This presents each representation of text between two markers (in this instance Here and String) on a newline and preserves newlines within the text.


GNU grep can also support positive & negative look-ahead & look-back: For your case, the command would be:

echo "Here is a string" | grep -o -P '(?<=Here).*(?=string)'

If there are multiple occurrences of Here and string, you can choose whether you want to match from the first Here and last string or match them individually. In terms of regex, it is called as greedy match (first case) or non-greedy match (second case)

$ echo 'Here is a string, and Here is another string.' | grep -oP '(?<=Here).*(?=string)' # Greedy match
 is a string, and Here is another 
$ echo 'Here is a string, and Here is another string.' | grep -oP '(?<=Here).*?(?=string)' # Non-greedy match (Notice the '?' after '*' in .*)
 is a 
 is another 

Examples related to string

How to split a string in two and store it in a field String method cannot be found in a main class method Kotlin - How to correctly concatenate a String Replacing a character from a certain index Remove quotes from String in Python Detect whether a Python string is a number or a letter How does String substring work in Swift How does String.Index work in Swift swift 3.0 Data to String? How to parse JSON string in Typescript

Examples related to bash

Comparing a variable with a string python not working when redirecting from bash script Zipping a file in bash fails How do I prevent Conda from activating the base environment by default? Get first line of a shell command's output Fixing a systemd service 203/EXEC failure (no such file or directory) /bin/sh: apt-get: not found VSCode Change Default Terminal Run bash command on jenkins pipeline How to check if the docker engine and a docker container are running? How to switch Python versions in Terminal?

Examples related to sed

Retrieve last 100 lines logs How to replace multiple patterns at once with sed? Insert multiple lines into a file after specified pattern using shell script Linux bash script to extract IP address Ansible playbook shell output remove white space from the end of line in linux bash, extract string before a colon invalid command code ., despite escaping periods, using sed RE error: illegal byte sequence on Mac OS X How to use variables in a command in sed?

Examples related to grep

grep's at sign caught as whitespace cat, grep and cut - translated to python How to suppress binary file matching results in grep Linux find and grep command together Filtering JSON array using jQuery grep() Linux Script to check if process is running and act on the result grep without showing path/file:line How do you grep a file and get the next 5 lines How to grep, excluding some patterns? Fast way of finding lines in one file that are not in another?