How do I make a python regex like "(.*)"
such that, given "a (b) c (d) e"
python matches "b"
instead of "b) c (d"
?
I know that I can use "[^)]"
instead of "."
, but I'm looking for a more general solution that keeps my regex a little cleaner. Is there any way to tell python "hey, match this as soon as possible"?
This question is related to
python
regex
regex-greedy
As the others have said using the ? modifier on the * quantifier will solve your immediate problem, but be careful, you are starting to stray into areas where regexes stop working and you need a parser instead. For instance, the string "(foo (bar)) baz" will cause you problems.
Would not \\(.*?\\)
work? That is the non-greedy syntax.
Using an ungreedy match is a good start, but I'd also suggest that you reconsider any use of .*
-- what about this?
groups = re.search(r"\([^)]*\)", x)
Do you want it to match "(b)"? Do as Zitrax and Paolo have suggested. Do you want it to match "b"? Do
>>> x = "a (b) c (d) e"
>>> re.search(r"\((.*?)\)", x).group(1)
'b'
To start with, I do not suggest using "*" in regexes. Yes, I know, it is the most used multi-character delimiter, but it is nevertheless a bad idea. This is because, while it does match any amount of repetition for that character, "any" includes 0, which is usually something you want to throw a syntax error for, not accept. Instead, I suggest using the +
sign, which matches any repetition of length > 1. What's more, from what I can see, you are dealing with fixed-length parenthesized expressions. As a result, you can probably use the {x, y}
syntax to specifically specify the desired length.
However, if you really do need non-greedy repetition, I suggest consulting the all-powerful ?
. This, when placed after at the end of any regex repetition specifier, will force that part of the regex to find the least amount of text possible.
That being said, I would be very careful with the ?
as it, like the Sonic Screwdriver in Dr. Who, has a tendency to do, how should I put it, "slightly" undesired things if not carefully calibrated. For example, to use your example input, it would identify ((1)
(note the lack of a second rparen) as a match.
>>> x = "a (b) c (d) e"
>>> re.search(r"\(.*\)", x).group()
'(b) c (d)'
>>> re.search(r"\(.*?\)", x).group()
'(b)'
The '
*
', '+
', and '?
' qualifiers are all greedy; they match as much text as possible. Sometimes this behavior isn’t desired; if the RE<.*>
is matched against '<H1>title</H1>
', it will match the entire string, and not just '<H1>
'. Adding '?
' after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using.*?
in the previous expression will match only '<H1>
'.
Source: Stackoverflow.com