[regex] RegEx: Grabbing values between quotation marks

I have a value like this:

"Foo Bar" "Another Value" something else

What regex will return the values enclosed in the quotation marks (e.g. Foo Bar and Another Value)?

This question is related to regex

The answer is


Peculiarly, none of these answers produce a regex where the returned match is the text inside the quotes, which is what is asked for. MA-Madden tries but only gets the inside match as a captured group rather than the whole match. One way to actually do it would be :

(?<=(["']\b))(?:(?=(\\?))\2.)*?(?=\1)

Examples for this can be seen in this demo https://regex101.com/r/Hbj8aP/1

The key here is the the positive lookbehind at the start (the ?<= ) and the positive lookahead at the end (the ?=). The lookbehind is looking behind the current character to check for a quote, if found then start from there and then the lookahead is checking the character ahead for a quote and if found stop on that character. The lookbehind group (the ["']) is wrapped in brackets to create a group for whichever quote was found at the start, this is then used at the end lookahead (?=\1) to make sure it only stops when it finds the corresponding quote.

The only other complication is that because the lookahead doesn't actually consume the end quote, it will be found again by the starting lookbehind which causes text between ending and starting quotes on the same line to be matched. Putting a word boundary on the opening quote (["']\b) helps with this, though ideally I'd like to move past the lookahead but I don't think that is possible. The bit allowing escaped characters in the middle I've taken directly from Adam's answer.


string = "\" foo bar\" \"loloo\""
print re.findall(r'"(.*?)"',string)

just try this out , works like a charm !!!

\ indicates skip character


If you're trying to find strings that only have a certain suffix, such as dot syntax, you can try this:

\"([^\"]*?[^\"]*?)\".localized

Where .localized is the suffix.

Example:

print("this is something I need to return".localized + "so is this".localized + "but this is not")

It will capture "this is something I need to return".localized and "so is this".localized but not "but this is not".


A very late answer, but like to answer

(\"[\w\s]+\")

http://regex101.com/r/cB0kB8/1


I liked Axeman's more expansive version, but had some trouble with it (it didn't match for example

foo "string \\ string" bar

or

foo "string1"   bar   "string2"

correctly, so I tried to fix it:

# opening quote
(["'])
   (
     # repeat (non-greedy, so we don't span multiple strings)
     (?:
       # anything, except not the opening quote, and not 
       # a backslash, which are handled separately.
       (?!\1)[^\\]
       |
       # consume any double backslash (unnecessary?)
       (?:\\\\)*       
       |
       # Allow backslash to escape characters
       \\.
     )*?
   )
# same character as opening quote
\1

The RegEx of accepted answer returns the values including their sourrounding quotation marks: "Foo Bar" and "Another Value" as matches.

Here are RegEx which return only the values between quotation marks (as the questioner was asking for):

Double quotes only (use value of capture group #1):

"(.*?[^\\])"

Single quotes only (use value of capture group #1):

'(.*?[^\\])'

Both (use value of capture group #2):

(["'])(.*?[^\\])\1

-

All support escaped and nested quotes.


I would go for:

"([^"]*)"

The [^"] is regex for any character except '"'
The reason I use this over the non greedy many operator is that I have to keep looking that up just to make sure I get it correct.


Unlike Adam's answer, I have a simple but worked one:

(["'])(?:\\\1|.)*?\1

And just add parenthesis if you want to get content in quotes like this:

(["'])((?:\\\1|.)*?)\1

Then $1 matches quote char and $2 matches content string.


I would go for:

"([^"]*)"

The [^"] is regex for any character except '"'
The reason I use this over the non greedy many operator is that I have to keep looking that up just to make sure I get it correct.


For me worked this one:

|([\'"])(.*?)\1|i

I've used in a sentence like this one:

preg_match_all('|([\'"])(.*?)\1|i', $cont, $matches);

and it worked great.


In general, the following regular expression fragment is what you are looking for:

"(.*?)"

This uses the non-greedy *? operator to capture everything up to but not including the next double quote. Then, you use a language-specific mechanism to extract the matched text.

In Python, you could do:

>>> import re
>>> string = '"Foo Bar" "Another Value"'
>>> print re.findall(r'"(.*?)"', string)
['Foo Bar', 'Another Value']

I liked Axeman's more expansive version, but had some trouble with it (it didn't match for example

foo "string \\ string" bar

or

foo "string1"   bar   "string2"

correctly, so I tried to fix it:

# opening quote
(["'])
   (
     # repeat (non-greedy, so we don't span multiple strings)
     (?:
       # anything, except not the opening quote, and not 
       # a backslash, which are handled separately.
       (?!\1)[^\\]
       |
       # consume any double backslash (unnecessary?)
       (?:\\\\)*       
       |
       # Allow backslash to escape characters
       \\.
     )*?
   )
# same character as opening quote
\1

From Greg H. I was able to create this regex to suit my needs.

I needed to match a specific value that was qualified by being inside quotes. It must be a full match, no partial matching could should trigger a hit

e.g. "test" could not match for "test2".

reg = r"""(['"])(%s)\1"""
if re.search(reg%(needle), haystack, re.IGNORECASE):
    print "winning..."

Hunter


All the answer above are good.... except they DOES NOT support all the unicode characters! at ECMA Script (Javascript)

If you are a Node users, you might want the the modified version of accepted answer that support all unicode characters :

/(?<=((?<=[\s,.:;"']|^)["']))(?:(?=(\\?))\2.)*?(?=\1)/gmu

Try here.


In general, the following regular expression fragment is what you are looking for:

"(.*?)"

This uses the non-greedy *? operator to capture everything up to but not including the next double quote. Then, you use a language-specific mechanism to extract the matched text.

In Python, you could do:

>>> import re
>>> string = '"Foo Bar" "Another Value"'
>>> print re.findall(r'"(.*?)"', string)
['Foo Bar', 'Another Value']

A very late answer, but like to answer

(\"[\w\s]+\")

http://regex101.com/r/cB0kB8/1


MORE ANSWERS! Here is the solution i used

\"([^\"]*?icon[^\"]*?)\"

TLDR;
replace the word icon with what your looking for in said quotes and voila!


The way this works is it looks for the keyword and doesn't care what else in between the quotes. EG:
id="fb-icon"
id="icon-close"
id="large-icon-close"
the regex looks for a quote mark "
then it looks for any possible group of letters thats not "
until it finds icon
and any possible group of letters that is not "
it then looks for a closing "


Peculiarly, none of these answers produce a regex where the returned match is the text inside the quotes, which is what is asked for. MA-Madden tries but only gets the inside match as a captured group rather than the whole match. One way to actually do it would be :

(?<=(["']\b))(?:(?=(\\?))\2.)*?(?=\1)

Examples for this can be seen in this demo https://regex101.com/r/Hbj8aP/1

The key here is the the positive lookbehind at the start (the ?<= ) and the positive lookahead at the end (the ?=). The lookbehind is looking behind the current character to check for a quote, if found then start from there and then the lookahead is checking the character ahead for a quote and if found stop on that character. The lookbehind group (the ["']) is wrapped in brackets to create a group for whichever quote was found at the start, this is then used at the end lookahead (?=\1) to make sure it only stops when it finds the corresponding quote.

The only other complication is that because the lookahead doesn't actually consume the end quote, it will be found again by the starting lookbehind which causes text between ending and starting quotes on the same line to be matched. Putting a word boundary on the opening quote (["']\b) helps with this, though ideally I'd like to move past the lookahead but I don't think that is possible. The bit allowing escaped characters in the middle I've taken directly from Adam's answer.


Unlike Adam's answer, I have a simple but worked one:

(["'])(?:\\\1|.)*?\1

And just add parenthesis if you want to get content in quotes like this:

(["'])((?:\\\1|.)*?)\1

Then $1 matches quote char and $2 matches content string.


Lets see two efficient ways that deal with escaped quotes. These patterns are not designed to be concise nor aesthetic, but to be efficient.

These ways use the first character discrimination to quickly find quotes in the string without the cost of an alternation. (The idea is to discard quickly characters that are not quotes without to test the two branches of the alternation.)

Content between quotes is described with an unrolled loop (instead of a repeated alternation) to be more efficient too: [^"\\]*(?:\\.[^"\\]*)*

Obviously to deal with strings that haven't balanced quotes, you can use possessive quantifiers instead: [^"\\]*+(?:\\.[^"\\]*)*+ or a workaround to emulate them, to prevent too much backtracking. You can choose too that a quoted part can be an opening quote until the next (non-escaped) quote or the end of the string. In this case there is no need to use possessive quantifiers, you only need to make the last quote optional.

Notice: sometimes quotes are not escaped with a backslash but by repeating the quote. In this case the content subpattern looks like this: [^"]*(?:""[^"]*)*

The patterns avoid the use of a capture group and a backreference (I mean something like (["']).....\1) and use a simple alternation but with ["'] at the beginning, in factor.

Perl like:

["'](?:(?<=")[^"\\]*(?s:\\.[^"\\]*)*"|(?<=')[^'\\]*(?s:\\.[^'\\]*)*')

(note that (?s:...) is a syntactic sugar to switch on the dotall/singleline mode inside the non-capturing group. If this syntax is not supported you can easily switch this mode on for all the pattern or replace the dot with [\s\S])

(The way this pattern is written is totally "hand-driven" and doesn't take account of eventual engine internal optimizations)

ECMA script:

(?=["'])(?:"[^"\\]*(?:\\[\s\S][^"\\]*)*"|'[^'\\]*(?:\\[\s\S][^'\\]*)*')

POSIX extended:

"[^"\\]*(\\(.|\n)[^"\\]*)*"|'[^'\\]*(\\(.|\n)[^'\\]*)*'

or simply:

"([^"\\]|\\.|\\\n)*"|'([^'\\]|\\.|\\\n)*'

A supplementary answer for the subset of Microsoft VBA coders only one uses the library Microsoft VBScript Regular Expressions 5.5 and this gives the following code

Sub TestRegularExpression()

    Dim oRE As VBScript_RegExp_55.RegExp    '* Tools->References: Microsoft VBScript Regular Expressions 5.5
    Set oRE = New VBScript_RegExp_55.RegExp

    oRE.Pattern = """([^""]*)"""


    oRE.Global = True

    Dim sTest As String
    sTest = """Foo Bar"" ""Another Value"" something else"

    Debug.Assert oRE.test(sTest)

    Dim oMatchCol As VBScript_RegExp_55.MatchCollection
    Set oMatchCol = oRE.Execute(sTest)
    Debug.Assert oMatchCol.Count = 2

    Dim oMatch As Match
    For Each oMatch In oMatchCol
        Debug.Print oMatch.SubMatches(0)

    Next oMatch

End Sub

If you're trying to find strings that only have a certain suffix, such as dot syntax, you can try this:

\"([^\"]*?[^\"]*?)\".localized

Where .localized is the suffix.

Example:

print("this is something I need to return".localized + "so is this".localized + "but this is not")

It will capture "this is something I need to return".localized and "so is this".localized but not "but this is not".


The RegEx of accepted answer returns the values including their sourrounding quotation marks: "Foo Bar" and "Another Value" as matches.

Here are RegEx which return only the values between quotation marks (as the questioner was asking for):

Double quotes only (use value of capture group #1):

"(.*?[^\\])"

Single quotes only (use value of capture group #1):

'(.*?[^\\])'

Both (use value of capture group #2):

(["'])(.*?[^\\])\1

-

All support escaped and nested quotes.


This version

  • accounts for escaped quotes
  • controls backtracking

    /(["'])((?:(?!\1)[^\\]|(?:\\\\)*\\[^\\])*)\1/
    

A supplementary answer for the subset of Microsoft VBA coders only one uses the library Microsoft VBScript Regular Expressions 5.5 and this gives the following code

Sub TestRegularExpression()

    Dim oRE As VBScript_RegExp_55.RegExp    '* Tools->References: Microsoft VBScript Regular Expressions 5.5
    Set oRE = New VBScript_RegExp_55.RegExp

    oRE.Pattern = """([^""]*)"""


    oRE.Global = True

    Dim sTest As String
    sTest = """Foo Bar"" ""Another Value"" something else"

    Debug.Assert oRE.test(sTest)

    Dim oMatchCol As VBScript_RegExp_55.MatchCollection
    Set oMatchCol = oRE.Execute(sTest)
    Debug.Assert oMatchCol.Count = 2

    Dim oMatch As Match
    For Each oMatch In oMatchCol
        Debug.Print oMatch.SubMatches(0)

    Next oMatch

End Sub

For me worked this one:

|([\'"])(.*?)\1|i

I've used in a sentence like this one:

preg_match_all('|([\'"])(.*?)\1|i', $cont, $matches);

and it worked great.


Lets see two efficient ways that deal with escaped quotes. These patterns are not designed to be concise nor aesthetic, but to be efficient.

These ways use the first character discrimination to quickly find quotes in the string without the cost of an alternation. (The idea is to discard quickly characters that are not quotes without to test the two branches of the alternation.)

Content between quotes is described with an unrolled loop (instead of a repeated alternation) to be more efficient too: [^"\\]*(?:\\.[^"\\]*)*

Obviously to deal with strings that haven't balanced quotes, you can use possessive quantifiers instead: [^"\\]*+(?:\\.[^"\\]*)*+ or a workaround to emulate them, to prevent too much backtracking. You can choose too that a quoted part can be an opening quote until the next (non-escaped) quote or the end of the string. In this case there is no need to use possessive quantifiers, you only need to make the last quote optional.

Notice: sometimes quotes are not escaped with a backslash but by repeating the quote. In this case the content subpattern looks like this: [^"]*(?:""[^"]*)*

The patterns avoid the use of a capture group and a backreference (I mean something like (["']).....\1) and use a simple alternation but with ["'] at the beginning, in factor.

Perl like:

["'](?:(?<=")[^"\\]*(?s:\\.[^"\\]*)*"|(?<=')[^'\\]*(?s:\\.[^'\\]*)*')

(note that (?s:...) is a syntactic sugar to switch on the dotall/singleline mode inside the non-capturing group. If this syntax is not supported you can easily switch this mode on for all the pattern or replace the dot with [\s\S])

(The way this pattern is written is totally "hand-driven" and doesn't take account of eventual engine internal optimizations)

ECMA script:

(?=["'])(?:"[^"\\]*(?:\\[\s\S][^"\\]*)*"|'[^'\\]*(?:\\[\s\S][^'\\]*)*')

POSIX extended:

"[^"\\]*(\\(.|\n)[^"\\]*)*"|'[^'\\]*(\\(.|\n)[^'\\]*)*'

or simply:

"([^"\\]|\\.|\\\n)*"|'([^'\\]|\\.|\\\n)*'

echo 'junk "Foo Bar" not empty one "" this "but this" and this neither' | sed 's/[^\"]*\"\([^\"]*\)\"[^\"]*/>\1</g'

This will result in: >Foo Bar<><>but this<

Here I showed the result string between ><'s for clarity, also using the non-greedy version with this sed command we first throw out the junk before and after that ""'s and then replace this with the part between the ""'s and surround this by ><'s.


The pattern (["'])(?:(?=(\\?))\2.)*?\1 above does the job but I am concerned of its performances (it's not bad but could be better). Mine below it's ~20% faster.

The pattern "(.*?)" is just incomplete. My advice for everyone reading this is just DON'T USE IT!!!

For instance it cannot capture many strings (if needed I can provide an exhaustive test-case) like the one below:

$string = 'How are you? I\'m fine, thank you';

The rest of them are just as "good" as the one above.

If you really care both about performance and precision then start with the one below:

/(['"])((\\\1|.)*?)\1/gm

In my tests it covered every string I met but if you find something that doesn't work I would gladly update it for you.

Check my pattern in an online regex tester.


echo 'junk "Foo Bar" not empty one "" this "but this" and this neither' | sed 's/[^\"]*\"\([^\"]*\)\"[^\"]*/>\1</g'

This will result in: >Foo Bar<><>but this<

Here I showed the result string between ><'s for clarity, also using the non-greedy version with this sed command we first throw out the junk before and after that ""'s and then replace this with the part between the ""'s and surround this by ><'s.


MORE ANSWERS! Here is the solution i used

\"([^\"]*?icon[^\"]*?)\"

TLDR;
replace the word icon with what your looking for in said quotes and voila!


The way this works is it looks for the keyword and doesn't care what else in between the quotes. EG:
id="fb-icon"
id="icon-close"
id="large-icon-close"
the regex looks for a quote mark "
then it looks for any possible group of letters thats not "
until it finds icon
and any possible group of letters that is not "
it then looks for a closing "


string = "\" foo bar\" \"loloo\""
print re.findall(r'"(.*?)"',string)

just try this out , works like a charm !!!

\ indicates skip character


This version

  • accounts for escaped quotes
  • controls backtracking

    /(["'])((?:(?!\1)[^\\]|(?:\\\\)*\\[^\\])*)\1/
    

All the answer above are good.... except they DOES NOT support all the unicode characters! at ECMA Script (Javascript)

If you are a Node users, you might want the the modified version of accepted answer that support all unicode characters :

/(?<=((?<=[\s,.:;"']|^)["']))(?:(?=(\\?))\2.)*?(?=\1)/gmu

Try here.


In general, the following regular expression fragment is what you are looking for:

"(.*?)"

This uses the non-greedy *? operator to capture everything up to but not including the next double quote. Then, you use a language-specific mechanism to extract the matched text.

In Python, you could do:

>>> import re
>>> string = '"Foo Bar" "Another Value"'
>>> print re.findall(r'"(.*?)"', string)
['Foo Bar', 'Another Value']

This version

  • accounts for escaped quotes
  • controls backtracking

    /(["'])((?:(?!\1)[^\\]|(?:\\\\)*\\[^\\])*)\1/
    

echo 'junk "Foo Bar" not empty one "" this "but this" and this neither' | sed 's/[^\"]*\"\([^\"]*\)\"[^\"]*/>\1</g'

This will result in: >Foo Bar<><>but this<

Here I showed the result string between ><'s for clarity, also using the non-greedy version with this sed command we first throw out the junk before and after that ""'s and then replace this with the part between the ""'s and surround this by ><'s.


I would go for:

"([^"]*)"

The [^"] is regex for any character except '"'
The reason I use this over the non greedy many operator is that I have to keep looking that up just to make sure I get it correct.


I liked Eugen Mihailescu's solution to match the content between quotes whilst allowing to escape quotes. However, I discovered some problems with escaping and came up with the following regex to fix them:

(['"])(?:(?!\1|\\).|\\.)*\1

It does the trick and is still pretty simple and easy to maintain.

Demo (with some more test-cases; feel free to use it and expand on it).


PS: If you just want the content between quotes in the full match ($0), and are not afraid of the performance penalty use:

(?<=(['"])\b)(?:(?!\1|\\).|\\.)*(?=\1)

Unfortunately, without the quotes as anchors, I had to add a boundary \b which does not play well with spaces and non-word boundary characters after the starting quote.

Alternatively, modify the initial version by simply adding a group and extract the string form $2:

(['"])((?:(?!\1|\\).|\\.)*)\1

PPS: If your focus is solely on efficiency, go with Casimir et Hippolyte's solution; it's a good one.


This version

  • accounts for escaped quotes
  • controls backtracking

    /(["'])((?:(?!\1)[^\\]|(?:\\\\)*\\[^\\])*)\1/
    

I liked Eugen Mihailescu's solution to match the content between quotes whilst allowing to escape quotes. However, I discovered some problems with escaping and came up with the following regex to fix them:

(['"])(?:(?!\1|\\).|\\.)*\1

It does the trick and is still pretty simple and easy to maintain.

Demo (with some more test-cases; feel free to use it and expand on it).


PS: If you just want the content between quotes in the full match ($0), and are not afraid of the performance penalty use:

(?<=(['"])\b)(?:(?!\1|\\).|\\.)*(?=\1)

Unfortunately, without the quotes as anchors, I had to add a boundary \b which does not play well with spaces and non-word boundary characters after the starting quote.

Alternatively, modify the initial version by simply adding a group and extract the string form $2:

(['"])((?:(?!\1|\\).|\\.)*)\1

PPS: If your focus is solely on efficiency, go with Casimir et Hippolyte's solution; it's a good one.


The pattern (["'])(?:(?=(\\?))\2.)*?\1 above does the job but I am concerned of its performances (it's not bad but could be better). Mine below it's ~20% faster.

The pattern "(.*?)" is just incomplete. My advice for everyone reading this is just DON'T USE IT!!!

For instance it cannot capture many strings (if needed I can provide an exhaustive test-case) like the one below:

$string = 'How are you? I\'m fine, thank you';

The rest of them are just as "good" as the one above.

If you really care both about performance and precision then start with the one below:

/(['"])((\\\1|.)*?)\1/gm

In my tests it covered every string I met but if you find something that doesn't work I would gladly update it for you.

Check my pattern in an online regex tester.


In general, the following regular expression fragment is what you are looking for:

"(.*?)"

This uses the non-greedy *? operator to capture everything up to but not including the next double quote. Then, you use a language-specific mechanism to extract the matched text.

In Python, you could do:

>>> import re
>>> string = '"Foo Bar" "Another Value"'
>>> print re.findall(r'"(.*?)"', string)
['Foo Bar', 'Another Value']

From Greg H. I was able to create this regex to suit my needs.

I needed to match a specific value that was qualified by being inside quotes. It must be a full match, no partial matching could should trigger a hit

e.g. "test" could not match for "test2".

reg = r"""(['"])(%s)\1"""
if re.search(reg%(needle), haystack, re.IGNORECASE):
    print "winning..."

Hunter