What is the difference between the search()
and match()
functions in the Python re
module?
I've read the documentation (current documentation), but I never seem to remember it. I keep having to look it up and re-learn it. I'm hoping that someone will answer it clearly with examples so that (perhaps) it will stick in my head. Or at least I'll have a better place to return with my question and it will take less time to re-learn it.
match is much faster than search, so instead of doing regex.search("word") you can do regex.match((.*?)word(.*?)) and gain tons of performance if you are working with millions of samples.
This comment from @ivan_bilan under the accepted answer above got me thinking if such hack is actually speeding anything up, so let's find out how many tons of performance you will really gain.
I prepared the following test suite:
import random
import re
import string
import time
LENGTH = 10
LIST_SIZE = 1000000
def generate_word():
word = [random.choice(string.ascii_lowercase) for _ in range(LENGTH)]
word = ''.join(word)
return word
wordlist = [generate_word() for _ in range(LIST_SIZE)]
start = time.time()
[re.search('python', word) for word in wordlist]
print('search:', time.time() - start)
start = time.time()
[re.match('(.*?)python(.*?)', word) for word in wordlist]
print('match:', time.time() - start)
I made 10 measurements (1M, 2M, ..., 10M words) which gave me the following plot:
The resulting lines are surprisingly (actually not that surprisingly) straight. And the search
function is (slightly) faster given this specific pattern combination. The moral of this test: Avoid overoptimizing your code.
The difference is, re.match()
misleads anyone accustomed to Perl, grep, or sed regular expression matching, and re.search()
does not. :-)
More soberly, As John D. Cook remarks, re.match()
"behaves as if every pattern has ^ prepended." In other words, re.match('pattern')
equals re.search('^pattern')
. So it anchors a pattern's left side. But it also doesn't anchor a pattern's right side: that still requires a terminating $
.
Frankly given the above, I think re.match()
should be deprecated. I would be interested to know reasons it should be retained.
You can refer the below example to understand the working of re.match
and re.search
a = "123abc"
t = re.match("[a-z]+",a)
t = re.search("[a-z]+",a)
re.match
will return none
, but re.search
will return abc
.
re.match attempts to match a pattern at the beginning of the string. re.search attempts to match the pattern throughout the string until it finds a match.
re.search
searches for the pattern throughout the string, whereas re.match
does not search the pattern; if it does not, it has no other choice than to match it at start of the string.
search
⇒ find something anywhere in the string and return a match object.
match
⇒ find something at the beginning of the string and return a match object.
re.match attempts to match a pattern at the beginning of the string. re.search attempts to match the pattern throughout the string until it finds a match.
The difference is, re.match()
misleads anyone accustomed to Perl, grep, or sed regular expression matching, and re.search()
does not. :-)
More soberly, As John D. Cook remarks, re.match()
"behaves as if every pattern has ^ prepended." In other words, re.match('pattern')
equals re.search('^pattern')
. So it anchors a pattern's left side. But it also doesn't anchor a pattern's right side: that still requires a terminating $
.
Frankly given the above, I think re.match()
should be deprecated. I would be interested to know reasons it should be retained.
re.search
searches for the pattern throughout the string, whereas re.match
does not search the pattern; if it does not, it has no other choice than to match it at start of the string.
Much shorter:
search
scans through the whole string.
match
scans only the beginning of the string.
Following Ex says it:
>>> a = "123abc"
>>> re.match("[a-z]+",a)
None
>>> re.search("[a-z]+",a)
abc
search
⇒ find something anywhere in the string and return a match object.
match
⇒ find something at the beginning of the string and return a match object.
re.match attempts to match a pattern at the beginning of the string. re.search attempts to match the pattern throughout the string until it finds a match.
You can refer the below example to understand the working of re.match
and re.search
a = "123abc"
t = re.match("[a-z]+",a)
t = re.search("[a-z]+",a)
re.match
will return none
, but re.search
will return abc
.
match is much faster than search, so instead of doing regex.search("word") you can do regex.match((.*?)word(.*?)) and gain tons of performance if you are working with millions of samples.
This comment from @ivan_bilan under the accepted answer above got me thinking if such hack is actually speeding anything up, so let's find out how many tons of performance you will really gain.
I prepared the following test suite:
import random
import re
import string
import time
LENGTH = 10
LIST_SIZE = 1000000
def generate_word():
word = [random.choice(string.ascii_lowercase) for _ in range(LENGTH)]
word = ''.join(word)
return word
wordlist = [generate_word() for _ in range(LIST_SIZE)]
start = time.time()
[re.search('python', word) for word in wordlist]
print('search:', time.time() - start)
start = time.time()
[re.match('(.*?)python(.*?)', word) for word in wordlist]
print('match:', time.time() - start)
I made 10 measurements (1M, 2M, ..., 10M words) which gave me the following plot:
The resulting lines are surprisingly (actually not that surprisingly) straight. And the search
function is (slightly) faster given this specific pattern combination. The moral of this test: Avoid overoptimizing your code.
re.match attempts to match a pattern at the beginning of the string. re.search attempts to match the pattern throughout the string until it finds a match.
Source: Stackoverflow.com