[python] Is it worth using Python's re.compile?

Is there any benefit in using compile for regular expressions in Python?

h = re.compile('hello')
h.match('hello world')

vs

re.match('hello', 'hello world')

This question is related to python regex

The answer is


Regular Expressions are compiled before being used when using the second version. If you are going to executing it many times it is definatly better to compile it first. If not compiling every time you match for one off's is fine.


As an alternative answer, as I see that it hasn't been mentioned before, I'll go ahead and quote the Python 3 docs:

Should you use these module-level functions, or should you get the pattern and call its methods yourself? If you’re accessing a regex within a loop, pre-compiling it will save a few function calls. Outside of loops, there’s not much difference thanks to the internal cache.


I ran this test before stumbling upon the discussion here. However, having run it I thought I'd at least post my results.

I stole and bastardized the example in Jeff Friedl's "Mastering Regular Expressions". This is on a macbook running OSX 10.6 (2Ghz intel core 2 duo, 4GB ram). Python version is 2.6.1.

Run 1 - using re.compile

import re 
import time 
import fpformat
Regex1 = re.compile('^(a|b|c|d|e|f|g)+$') 
Regex2 = re.compile('^[a-g]+$')
TimesToDo = 1000
TestString = "" 
for i in range(1000):
    TestString += "abababdedfg"
StartTime = time.time() 
for i in range(TimesToDo):
    Regex1.search(TestString) 
Seconds = time.time() - StartTime 
print "Alternation takes " + fpformat.fix(Seconds,3) + " seconds"

StartTime = time.time() 
for i in range(TimesToDo):
    Regex2.search(TestString) 
Seconds = time.time() - StartTime 
print "Character Class takes " + fpformat.fix(Seconds,3) + " seconds"

Alternation takes 2.299 seconds
Character Class takes 0.107 seconds

Run 2 - Not using re.compile

import re 
import time 
import fpformat

TimesToDo = 1000
TestString = "" 
for i in range(1000):
    TestString += "abababdedfg"
StartTime = time.time() 
for i in range(TimesToDo):
    re.search('^(a|b|c|d|e|f|g)+$',TestString) 
Seconds = time.time() - StartTime 
print "Alternation takes " + fpformat.fix(Seconds,3) + " seconds"

StartTime = time.time() 
for i in range(TimesToDo):
    re.search('^[a-g]+$',TestString) 
Seconds = time.time() - StartTime 
print "Character Class takes " + fpformat.fix(Seconds,3) + " seconds"

Alternation takes 2.508 seconds
Character Class takes 0.109 seconds

Performance difference aside, using re.compile and using the compiled regular expression object to do match (whatever regular expression related operations) makes the semantics clearer to Python run-time.

I had some painful experience of debugging some simple code:

compare = lambda s, p: re.match(p, s)

and later I'd use compare in

[x for x in data if compare(patternPhrases, x[columnIndex])]

where patternPhrases is supposed to be a variable containing regular expression string, x[columnIndex] is a variable containing string.

I had trouble that patternPhrases did not match some expected string!

But if I used the re.compile form:

compare = lambda s, p: p.match(s)

then in

[x for x in data if compare(patternPhrases, x[columnIndex])]

Python would have complained that "string does not have attribute of match", as by positional argument mapping in compare, x[columnIndex] is used as regular expression!, when I actually meant

compare = lambda p, s: p.match(s)

In my case, using re.compile is more explicit of the purpose of regular expression, when it's value is hidden to naked eyes, thus I could get more help from Python run-time checking.

So the moral of my lesson is that when the regular expression is not just literal string, then I should use re.compile to let Python to help me to assert my assumption.


My understanding is that those two examples are effectively equivalent. The only difference is that in the first, you can reuse the compiled regular expression elsewhere without causing it to be compiled again.

Here's a reference for you: http://diveintopython3.ep.io/refactoring.html

Calling the compiled pattern object's search function with the string 'M' accomplishes the same thing as calling re.search with both the regular expression and the string 'M'. Only much, much faster. (In fact, the re.search function simply compiles the regular expression and calls the resulting pattern object's search method for you.)


I agree with Honest Abe that the match(...) in the given examples are different. They are not a one-to-one comparisons and thus, outcomes are vary. To simplify my reply, I use A, B, C, D for those functions in question. Oh yes, we are dealing with 4 functions in re.py instead of 3.

Running this piece of code:

h = re.compile('hello')                   # (A)
h.match('hello world')                    # (B)

is same as running this code:

re.match('hello', 'hello world')          # (C)

Because, when looked into the source re.py, (A + B) means:

h = re._compile('hello')                  # (D)
h.match('hello world')

and (C) is actually:

re._compile('hello').match('hello world')

So, (C) is not the same as (B). In fact, (C) calls (B) after calling (D) which is also called by (A). In other words, (C) = (A) + (B). Therefore, comparing (A + B) inside a loop has same result as (C) inside a loop.

George's regexTest.py proved this for us.

noncompiled took 4.555 seconds.           # (C) in a loop
compiledInLoop took 4.620 seconds.        # (A + B) in a loop
compiled took 2.323 seconds.              # (A) once + (B) in a loop

Everyone's interest is, how to get the result of 2.323 seconds. In order to make sure compile(...) only get called once, we need to store the compiled regex object in memory. If we are using a class, we could store the object and reuse when every time our function get called.

class Foo:
    regex = re.compile('hello')
    def my_function(text)
        return regex.match(text)

If we are not using class (which is my request today), then I have no comment. I'm still learning to use global variable in Python, and I know global variable is a bad thing.

One more point, I believe that using (A) + (B) approach has an upper hand. Here are some facts as I observed (please correct me if I'm wrong):

  1. Calls A once, it will do one search in the _cache followed by one sre_compile.compile() to create a regex object. Calls A twice, it will do two searches and one compile (because the regex object is cached).

  2. If the _cache get flushed in between, then the regex object is released from memory and Python need to compile again. (someone suggest that Python won't recompile.)

  3. If we keep the regex object by using (A), the regex object will still get into _cache and get flushed somehow. But our code keep a reference on it and the regex object will not be released from memory. Those, Python need not to compile again.

  4. The 2 seconds differences in George's test compiledInLoop vs compiled is mainly the time required to build the key and search the _cache. It doesn't mean the compile time of regex.

  5. George's reallycompile test show what happen if it really re-do the compile every time: it will be 100x slower (he reduced the loop from 1,000,000 to 10,000).

Here are the only cases that (A + B) is better than (C):

  1. If we can cache a reference of the regex object inside a class.
  2. If we need to calls (B) repeatedly (inside a loop or multiple times), we must cache the reference to regex object outside the loop.

Case that (C) is good enough:

  1. We cannot cache a reference.
  2. We only use it once in a while.
  3. In overall, we don't have too many regex (assume the compiled one never get flushed)

Just a recap, here are the A B C:

h = re.compile('hello')                   # (A)
h.match('hello world')                    # (B)
re.match('hello', 'hello world')          # (C)

Thanks for reading.


For me, the biggest benefit to re.compile is being able to separate definition of the regex from its use.

Even a simple expression such as 0|[1-9][0-9]* (integer in base 10 without leading zeros) can be complex enough that you'd rather not have to retype it, check if you made any typos, and later have to recheck if there are typos when you start debugging. Plus, it's nicer to use a variable name such as num or num_b10 than 0|[1-9][0-9]*.

It's certainly possible to store strings and pass them to re.match; however, that's less readable:

num = "..."
# then, much later:
m = re.match(num, input)

Versus compiling:

num = re.compile("...")
# then, much later:
m = num.match(input)

Though it is fairly close, the last line of the second feels more natural and simpler when used repeatedly.


This answer might be arriving late but is an interesting find. Using compile can really save you time if you are planning on using the regex multiple times (this is also mentioned in the docs). Below you can see that using a compiled regex is the fastest when the match method is directly called on it. passing a compiled regex to re.match makes it even slower and passing re.match with the patter string is somewhere in the middle.

>>> ipr = r'\D+((([0-2][0-5]?[0-5]?)\.){3}([0-2][0-5]?[0-5]?))\D+'
>>> average(*timeit.repeat("re.match(ipr, 'abcd100.10.255.255 ')", globals={'ipr': ipr, 're': re}))
1.5077415757028423
>>> ipr = re.compile(ipr)
>>> average(*timeit.repeat("re.match(ipr, 'abcd100.10.255.255 ')", globals={'ipr': ipr, 're': re}))
1.8324008992184038
>>> average(*timeit.repeat("ipr.match('abcd100.10.255.255 ')", globals={'ipr': ipr, 're': re}))
0.9187896518778871

I've had a lot of experience running a compiled regex 1000s of times versus compiling on-the-fly, and have not noticed any perceivable difference

The votes on the accepted answer leads to the assumption that what @Triptych says is true for all cases. This is not necessarily true. One big difference is when you have to decide whether to accept a regex string or a compiled regex object as a parameter to a function:

>>> timeit.timeit(setup="""
... import re
... f=lambda x, y: x.match(y)       # accepts compiled regex as parameter
... h=re.compile('hello')
... """, stmt="f(h, 'hello world')")
0.32881879806518555
>>> timeit.timeit(setup="""
... import re
... f=lambda x, y: re.compile(x).match(y)   # compiles when called
... """, stmt="f('hello', 'hello world')")
0.809190034866333

It is always better to compile your regexs in case you need to reuse them.

Note the example in the timeit above simulates creation of a compiled regex object once at import time versus "on-the-fly" when required for a match.


i'd like to motivate that pre-compiling is both conceptually and 'literately' (as in 'literate programming') advantageous. have a look at this code snippet:

from re import compile as _Re

class TYPO:

  def text_has_foobar( self, text ):
    return self._text_has_foobar_re_search( text ) is not None
  _text_has_foobar_re_search = _Re( r"""(?i)foobar""" ).search

TYPO = TYPO()

in your application, you'd write:

from TYPO import TYPO
print( TYPO.text_has_foobar( 'FOObar ) )

this is about as simple in terms of functionality as it can get. because this is example is so short, i conflated the way to get _text_has_foobar_re_search all in one line. the disadvantage of this code is that it occupies a little memory for whatever the lifetime of the TYPO library object is; the advantage is that when doing a foobar search, you'll get away with two function calls and two class dictionary lookups. how many regexes are cached by re and the overhead of that cache are irrelevant here.

compare this with the more usual style, below:

import re

class Typo:

  def text_has_foobar( self, text ):
    return re.compile( r"""(?i)foobar""" ).search( text ) is not None

In the application:

typo = Typo()
print( typo.text_has_foobar( 'FOObar ) )

I readily admit that my style is highly unusual for python, maybe even debatable. however, in the example that more closely matches how python is mostly used, in order to do a single match, we must instantiate an object, do three instance dictionary lookups, and perform three function calls; additionally, we might get into re caching troubles when using more than 100 regexes. also, the regular expression gets hidden inside the method body, which most of the time is not such a good idea.

be it said that every subset of measures---targeted, aliased import statements; aliased methods where applicable; reduction of function calls and object dictionary lookups---can help reduce computational and conceptual complexity.


This answer might be arriving late but is an interesting find. Using compile can really save you time if you are planning on using the regex multiple times (this is also mentioned in the docs). Below you can see that using a compiled regex is the fastest when the match method is directly called on it. passing a compiled regex to re.match makes it even slower and passing re.match with the patter string is somewhere in the middle.

>>> ipr = r'\D+((([0-2][0-5]?[0-5]?)\.){3}([0-2][0-5]?[0-5]?))\D+'
>>> average(*timeit.repeat("re.match(ipr, 'abcd100.10.255.255 ')", globals={'ipr': ipr, 're': re}))
1.5077415757028423
>>> ipr = re.compile(ipr)
>>> average(*timeit.repeat("re.match(ipr, 'abcd100.10.255.255 ')", globals={'ipr': ipr, 're': re}))
1.8324008992184038
>>> average(*timeit.repeat("ipr.match('abcd100.10.255.255 ')", globals={'ipr': ipr, 're': re}))
0.9187896518778871

I agree with Honest Abe that the match(...) in the given examples are different. They are not a one-to-one comparisons and thus, outcomes are vary. To simplify my reply, I use A, B, C, D for those functions in question. Oh yes, we are dealing with 4 functions in re.py instead of 3.

Running this piece of code:

h = re.compile('hello')                   # (A)
h.match('hello world')                    # (B)

is same as running this code:

re.match('hello', 'hello world')          # (C)

Because, when looked into the source re.py, (A + B) means:

h = re._compile('hello')                  # (D)
h.match('hello world')

and (C) is actually:

re._compile('hello').match('hello world')

So, (C) is not the same as (B). In fact, (C) calls (B) after calling (D) which is also called by (A). In other words, (C) = (A) + (B). Therefore, comparing (A + B) inside a loop has same result as (C) inside a loop.

George's regexTest.py proved this for us.

noncompiled took 4.555 seconds.           # (C) in a loop
compiledInLoop took 4.620 seconds.        # (A + B) in a loop
compiled took 2.323 seconds.              # (A) once + (B) in a loop

Everyone's interest is, how to get the result of 2.323 seconds. In order to make sure compile(...) only get called once, we need to store the compiled regex object in memory. If we are using a class, we could store the object and reuse when every time our function get called.

class Foo:
    regex = re.compile('hello')
    def my_function(text)
        return regex.match(text)

If we are not using class (which is my request today), then I have no comment. I'm still learning to use global variable in Python, and I know global variable is a bad thing.

One more point, I believe that using (A) + (B) approach has an upper hand. Here are some facts as I observed (please correct me if I'm wrong):

  1. Calls A once, it will do one search in the _cache followed by one sre_compile.compile() to create a regex object. Calls A twice, it will do two searches and one compile (because the regex object is cached).

  2. If the _cache get flushed in between, then the regex object is released from memory and Python need to compile again. (someone suggest that Python won't recompile.)

  3. If we keep the regex object by using (A), the regex object will still get into _cache and get flushed somehow. But our code keep a reference on it and the regex object will not be released from memory. Those, Python need not to compile again.

  4. The 2 seconds differences in George's test compiledInLoop vs compiled is mainly the time required to build the key and search the _cache. It doesn't mean the compile time of regex.

  5. George's reallycompile test show what happen if it really re-do the compile every time: it will be 100x slower (he reduced the loop from 1,000,000 to 10,000).

Here are the only cases that (A + B) is better than (C):

  1. If we can cache a reference of the regex object inside a class.
  2. If we need to calls (B) repeatedly (inside a loop or multiple times), we must cache the reference to regex object outside the loop.

Case that (C) is good enough:

  1. We cannot cache a reference.
  2. We only use it once in a while.
  3. In overall, we don't have too many regex (assume the compiled one never get flushed)

Just a recap, here are the A B C:

h = re.compile('hello')                   # (A)
h.match('hello world')                    # (B)
re.match('hello', 'hello world')          # (C)

Thanks for reading.


This is a good question. You often see people use re.compile without reason. It lessens readability. But sure there are lots of times when pre-compiling the expression is called for. Like when you use it repeated times in a loop or some such.

It's like everything about programming (everything in life actually). Apply common sense.


Legibility/cognitive load preference

To me, the main gain is that I only need to remember, and read, one form of the complicated regex API syntax - the <compiled_pattern>.method(xxx) form rather than that and the re.func(<pattern>, xxx) form.

The re.compile(<pattern>) is a bit of extra boilerplate, true.

But where regex are concerned, that extra compile step is unlikely to be a big cause of cognitive load. And in fact, on complicated patterns, you might even gain clarity from separating the declaration from whatever regex method you then invoke on it.

I tend to first tune complicated patterns in a website like Regex101, or even in a separate minimal test script, then bring them into my code, so separating the declaration from its use fits my workflow as well.


I've had a lot of experience running a compiled regex 1000s of times versus compiling on-the-fly, and have not noticed any perceivable difference

The votes on the accepted answer leads to the assumption that what @Triptych says is true for all cases. This is not necessarily true. One big difference is when you have to decide whether to accept a regex string or a compiled regex object as a parameter to a function:

>>> timeit.timeit(setup="""
... import re
... f=lambda x, y: x.match(y)       # accepts compiled regex as parameter
... h=re.compile('hello')
... """, stmt="f(h, 'hello world')")
0.32881879806518555
>>> timeit.timeit(setup="""
... import re
... f=lambda x, y: re.compile(x).match(y)   # compiles when called
... """, stmt="f('hello', 'hello world')")
0.809190034866333

It is always better to compile your regexs in case you need to reuse them.

Note the example in the timeit above simulates creation of a compiled regex object once at import time versus "on-the-fly" when required for a match.


For me, the biggest benefit to re.compile is being able to separate definition of the regex from its use.

Even a simple expression such as 0|[1-9][0-9]* (integer in base 10 without leading zeros) can be complex enough that you'd rather not have to retype it, check if you made any typos, and later have to recheck if there are typos when you start debugging. Plus, it's nicer to use a variable name such as num or num_b10 than 0|[1-9][0-9]*.

It's certainly possible to store strings and pass them to re.match; however, that's less readable:

num = "..."
# then, much later:
m = re.match(num, input)

Versus compiling:

num = re.compile("...")
# then, much later:
m = num.match(input)

Though it is fairly close, the last line of the second feels more natural and simpler when used repeatedly.


There is one addition perk of using re.compile(), in the form of adding comments to my regex patterns using re.VERBOSE

pattern = '''
hello[ ]world    # Some info on my pattern logic. [ ] to recognize space
'''

re.search(pattern, 'hello world', re.VERBOSE)

Although this does not affect the speed of running your code, I like to do it this way as it is part of my commenting habit. I throughly dislike spending time trying to remember the logic that went behind my code 2 months down the line when I want to make modifications.


Legibility/cognitive load preference

To me, the main gain is that I only need to remember, and read, one form of the complicated regex API syntax - the <compiled_pattern>.method(xxx) form rather than that and the re.func(<pattern>, xxx) form.

The re.compile(<pattern>) is a bit of extra boilerplate, true.

But where regex are concerned, that extra compile step is unlikely to be a big cause of cognitive load. And in fact, on complicated patterns, you might even gain clarity from separating the declaration from whatever regex method you then invoke on it.

I tend to first tune complicated patterns in a website like Regex101, or even in a separate minimal test script, then bring them into my code, so separating the declaration from its use fits my workflow as well.


FWIW:

$ python -m timeit -s "import re" "re.match('hello', 'hello world')"
100000 loops, best of 3: 3.82 usec per loop

$ python -m timeit -s "import re; h=re.compile('hello')" "h.match('hello world')"
1000000 loops, best of 3: 1.26 usec per loop

so, if you're going to be using the same regex a lot, it may be worth it to do re.compile (especially for more complex regexes).

The standard arguments against premature optimization apply, but I don't think you really lose much clarity/straightforwardness by using re.compile if you suspect that your regexps may become a performance bottleneck.

Update:

Under Python 3.6 (I suspect the above timings were done using Python 2.x) and 2018 hardware (MacBook Pro), I now get the following timings:

% python -m timeit -s "import re" "re.match('hello', 'hello world')"
1000000 loops, best of 3: 0.661 usec per loop

% python -m timeit -s "import re; h=re.compile('hello')" "h.match('hello world')"
1000000 loops, best of 3: 0.285 usec per loop

% python -m timeit -s "import re" "h=re.compile('hello'); h.match('hello world')"
1000000 loops, best of 3: 0.65 usec per loop

% python --version
Python 3.6.5 :: Anaconda, Inc.

I also added a case (notice the quotation mark differences between the last two runs) that shows that re.match(x, ...) is literally [roughly] equivalent to re.compile(x).match(...), i.e. no behind-the-scenes caching of the compiled representation seems to happen.


I just tried this myself. For the simple case of parsing a number out of a string and summing it, using a compiled regular expression object is about twice as fast as using the re methods.

As others have pointed out, the re methods (including re.compile) look up the regular expression string in a cache of previously compiled expressions. Therefore, in the normal case, the extra cost of using the re methods is simply the cost of the cache lookup.

However, examination of the code, shows the cache is limited to 100 expressions. This begs the question, how painful is it to overflow the cache? The code contains an internal interface to the regular expression compiler, re.sre_compile.compile. If we call it, we bypass the cache. It turns out to be about two orders of magnitude slower for a basic regular expression, such as r'\w+\s+([0-9_]+)\s+\w*'.

Here's my test:

#!/usr/bin/env python
import re
import time

def timed(func):
    def wrapper(*args):
        t = time.time()
        result = func(*args)
        t = time.time() - t
        print '%s took %.3f seconds.' % (func.func_name, t)
        return result
    return wrapper

regularExpression = r'\w+\s+([0-9_]+)\s+\w*'
testString = "average    2 never"

@timed
def noncompiled():
    a = 0
    for x in xrange(1000000):
        m = re.match(regularExpression, testString)
        a += int(m.group(1))
    return a

@timed
def compiled():
    a = 0
    rgx = re.compile(regularExpression)
    for x in xrange(1000000):
        m = rgx.match(testString)
        a += int(m.group(1))
    return a

@timed
def reallyCompiled():
    a = 0
    rgx = re.sre_compile.compile(regularExpression)
    for x in xrange(1000000):
        m = rgx.match(testString)
        a += int(m.group(1))
    return a


@timed
def compiledInLoop():
    a = 0
    for x in xrange(1000000):
        rgx = re.compile(regularExpression)
        m = rgx.match(testString)
        a += int(m.group(1))
    return a

@timed
def reallyCompiledInLoop():
    a = 0
    for x in xrange(10000):
        rgx = re.sre_compile.compile(regularExpression)
        m = rgx.match(testString)
        a += int(m.group(1))
    return a

r1 = noncompiled()
r2 = compiled()
r3 = reallyCompiled()
r4 = compiledInLoop()
r5 = reallyCompiledInLoop()
print "r1 = ", r1
print "r2 = ", r2
print "r3 = ", r3
print "r4 = ", r4
print "r5 = ", r5
</pre>
And here is the output on my machine:
<pre>
$ regexTest.py 
noncompiled took 4.555 seconds.
compiled took 2.323 seconds.
reallyCompiled took 2.325 seconds.
compiledInLoop took 4.620 seconds.
reallyCompiledInLoop took 4.074 seconds.
r1 =  2000000
r2 =  2000000
r3 =  2000000
r4 =  2000000
r5 =  20000

The 'reallyCompiled' methods use the internal interface, which bypasses the cache. Note the one that compiles on each loop iteration is only iterated 10,000 times, not one million.


This is a good question. You often see people use re.compile without reason. It lessens readability. But sure there are lots of times when pre-compiling the expression is called for. Like when you use it repeated times in a loop or some such.

It's like everything about programming (everything in life actually). Apply common sense.


Using the given examples:

h = re.compile('hello')
h.match('hello world')

The match method in the example above is not the same as the one used below:

re.match('hello', 'hello world')

re.compile() returns a regular expression object, which means h is a regex object.

The regex object has its own match method with the optional pos and endpos parameters:

regex.match(string[, pos[, endpos]])

pos

The optional second parameter pos gives an index in the string where the search is to start; it defaults to 0. This is not completely equivalent to slicing the string; the '^' pattern character matches at the real beginning of the string and at positions just after a newline, but not necessarily at the index where the search is to start.

endpos

The optional parameter endpos limits how far the string will be searched; it will be as if the string is endpos characters long, so only the characters from pos to endpos - 1 will be searched for a match. If endpos is less than pos, no match will be found; otherwise, if rx is a compiled regular expression object, rx.search(string, 0, 50) is equivalent to rx.search(string[:50], 0).

The regex object's search, findall, and finditer methods also support these parameters.

re.match(pattern, string, flags=0) does not support them as you can see,
nor does its search, findall, and finditer counterparts.

A match object has attributes that complement these parameters:

match.pos

The value of pos which was passed to the search() or match() method of a regex object. This is the index into the string at which the RE engine started looking for a match.

match.endpos

The value of endpos which was passed to the search() or match() method of a regex object. This is the index into the string beyond which the RE engine will not go.


A regex object has two unique, possibly useful, attributes:

regex.groups

The number of capturing groups in the pattern.

regex.groupindex

A dictionary mapping any symbolic group names defined by (?P) to group numbers. The dictionary is empty if no symbolic groups were used in the pattern.


And finally, a match object has this attribute:

match.re

The regular expression object whose match() or search() method produced this match instance.


Performance difference aside, using re.compile and using the compiled regular expression object to do match (whatever regular expression related operations) makes the semantics clearer to Python run-time.

I had some painful experience of debugging some simple code:

compare = lambda s, p: re.match(p, s)

and later I'd use compare in

[x for x in data if compare(patternPhrases, x[columnIndex])]

where patternPhrases is supposed to be a variable containing regular expression string, x[columnIndex] is a variable containing string.

I had trouble that patternPhrases did not match some expected string!

But if I used the re.compile form:

compare = lambda s, p: p.match(s)

then in

[x for x in data if compare(patternPhrases, x[columnIndex])]

Python would have complained that "string does not have attribute of match", as by positional argument mapping in compare, x[columnIndex] is used as regular expression!, when I actually meant

compare = lambda p, s: p.match(s)

In my case, using re.compile is more explicit of the purpose of regular expression, when it's value is hidden to naked eyes, thus I could get more help from Python run-time checking.

So the moral of my lesson is that when the regular expression is not just literal string, then I should use re.compile to let Python to help me to assert my assumption.


FWIW:

$ python -m timeit -s "import re" "re.match('hello', 'hello world')"
100000 loops, best of 3: 3.82 usec per loop

$ python -m timeit -s "import re; h=re.compile('hello')" "h.match('hello world')"
1000000 loops, best of 3: 1.26 usec per loop

so, if you're going to be using the same regex a lot, it may be worth it to do re.compile (especially for more complex regexes).

The standard arguments against premature optimization apply, but I don't think you really lose much clarity/straightforwardness by using re.compile if you suspect that your regexps may become a performance bottleneck.

Update:

Under Python 3.6 (I suspect the above timings were done using Python 2.x) and 2018 hardware (MacBook Pro), I now get the following timings:

% python -m timeit -s "import re" "re.match('hello', 'hello world')"
1000000 loops, best of 3: 0.661 usec per loop

% python -m timeit -s "import re; h=re.compile('hello')" "h.match('hello world')"
1000000 loops, best of 3: 0.285 usec per loop

% python -m timeit -s "import re" "h=re.compile('hello'); h.match('hello world')"
1000000 loops, best of 3: 0.65 usec per loop

% python --version
Python 3.6.5 :: Anaconda, Inc.

I also added a case (notice the quotation mark differences between the last two runs) that shows that re.match(x, ...) is literally [roughly] equivalent to re.compile(x).match(...), i.e. no behind-the-scenes caching of the compiled representation seems to happen.


Interestingly, compiling does prove more efficient for me (Python 2.5.2 on Win XP):

import re
import time

rgx = re.compile('(\w+)\s+[0-9_]?\s+\w*')
str = "average    2 never"
a = 0

t = time.time()

for i in xrange(1000000):
    if re.match('(\w+)\s+[0-9_]?\s+\w*', str):
    #~ if rgx.match(str):
        a += 1

print time.time() - t

Running the above code once as is, and once with the two if lines commented the other way around, the compiled regex is twice as fast


My understanding is that those two examples are effectively equivalent. The only difference is that in the first, you can reuse the compiled regular expression elsewhere without causing it to be compiled again.

Here's a reference for you: http://diveintopython3.ep.io/refactoring.html

Calling the compiled pattern object's search function with the string 'M' accomplishes the same thing as calling re.search with both the regular expression and the string 'M'. Only much, much faster. (In fact, the re.search function simply compiles the regular expression and calls the resulting pattern object's search method for you.)


Here's a simple test case:

~$ for x in 1 10 100 1000 10000 100000 1000000; do python -m timeit -n $x -s 'import re' 're.match("[0-9]{3}-[0-9]{3}-[0-9]{4}", "123-123-1234")'; done
1 loops, best of 3: 3.1 usec per loop
10 loops, best of 3: 2.41 usec per loop
100 loops, best of 3: 2.24 usec per loop
1000 loops, best of 3: 2.21 usec per loop
10000 loops, best of 3: 2.23 usec per loop
100000 loops, best of 3: 2.24 usec per loop
1000000 loops, best of 3: 2.31 usec per loop

with re.compile:

~$ for x in 1 10 100 1000 10000 100000 1000000; do python -m timeit -n $x -s 'import re' 'r = re.compile("[0-9]{3}-[0-9]{3}-[0-9]{4}")' 'r.match("123-123-1234")'; done
1 loops, best of 3: 1.91 usec per loop
10 loops, best of 3: 0.691 usec per loop
100 loops, best of 3: 0.701 usec per loop
1000 loops, best of 3: 0.684 usec per loop
10000 loops, best of 3: 0.682 usec per loop
100000 loops, best of 3: 0.694 usec per loop
1000000 loops, best of 3: 0.702 usec per loop

So, it would seem to compiling is faster with this simple case, even if you only match once.


There is one addition perk of using re.compile(), in the form of adding comments to my regex patterns using re.VERBOSE

pattern = '''
hello[ ]world    # Some info on my pattern logic. [ ] to recognize space
'''

re.search(pattern, 'hello world', re.VERBOSE)

Although this does not affect the speed of running your code, I like to do it this way as it is part of my commenting habit. I throughly dislike spending time trying to remember the logic that went behind my code 2 months down the line when I want to make modifications.


FWIW:

$ python -m timeit -s "import re" "re.match('hello', 'hello world')"
100000 loops, best of 3: 3.82 usec per loop

$ python -m timeit -s "import re; h=re.compile('hello')" "h.match('hello world')"
1000000 loops, best of 3: 1.26 usec per loop

so, if you're going to be using the same regex a lot, it may be worth it to do re.compile (especially for more complex regexes).

The standard arguments against premature optimization apply, but I don't think you really lose much clarity/straightforwardness by using re.compile if you suspect that your regexps may become a performance bottleneck.

Update:

Under Python 3.6 (I suspect the above timings were done using Python 2.x) and 2018 hardware (MacBook Pro), I now get the following timings:

% python -m timeit -s "import re" "re.match('hello', 'hello world')"
1000000 loops, best of 3: 0.661 usec per loop

% python -m timeit -s "import re; h=re.compile('hello')" "h.match('hello world')"
1000000 loops, best of 3: 0.285 usec per loop

% python -m timeit -s "import re" "h=re.compile('hello'); h.match('hello world')"
1000000 loops, best of 3: 0.65 usec per loop

% python --version
Python 3.6.5 :: Anaconda, Inc.

I also added a case (notice the quotation mark differences between the last two runs) that shows that re.match(x, ...) is literally [roughly] equivalent to re.compile(x).match(...), i.e. no behind-the-scenes caching of the compiled representation seems to happen.


For me, the biggest benefit to re.compile is being able to separate definition of the regex from its use.

Even a simple expression such as 0|[1-9][0-9]* (integer in base 10 without leading zeros) can be complex enough that you'd rather not have to retype it, check if you made any typos, and later have to recheck if there are typos when you start debugging. Plus, it's nicer to use a variable name such as num or num_b10 than 0|[1-9][0-9]*.

It's certainly possible to store strings and pass them to re.match; however, that's less readable:

num = "..."
# then, much later:
m = re.match(num, input)

Versus compiling:

num = re.compile("...")
# then, much later:
m = num.match(input)

Though it is fairly close, the last line of the second feels more natural and simpler when used repeatedly.


Using the given examples:

h = re.compile('hello')
h.match('hello world')

The match method in the example above is not the same as the one used below:

re.match('hello', 'hello world')

re.compile() returns a regular expression object, which means h is a regex object.

The regex object has its own match method with the optional pos and endpos parameters:

regex.match(string[, pos[, endpos]])

pos

The optional second parameter pos gives an index in the string where the search is to start; it defaults to 0. This is not completely equivalent to slicing the string; the '^' pattern character matches at the real beginning of the string and at positions just after a newline, but not necessarily at the index where the search is to start.

endpos

The optional parameter endpos limits how far the string will be searched; it will be as if the string is endpos characters long, so only the characters from pos to endpos - 1 will be searched for a match. If endpos is less than pos, no match will be found; otherwise, if rx is a compiled regular expression object, rx.search(string, 0, 50) is equivalent to rx.search(string[:50], 0).

The regex object's search, findall, and finditer methods also support these parameters.

re.match(pattern, string, flags=0) does not support them as you can see,
nor does its search, findall, and finditer counterparts.

A match object has attributes that complement these parameters:

match.pos

The value of pos which was passed to the search() or match() method of a regex object. This is the index into the string at which the RE engine started looking for a match.

match.endpos

The value of endpos which was passed to the search() or match() method of a regex object. This is the index into the string beyond which the RE engine will not go.


A regex object has two unique, possibly useful, attributes:

regex.groups

The number of capturing groups in the pattern.

regex.groupindex

A dictionary mapping any symbolic group names defined by (?P) to group numbers. The dictionary is empty if no symbolic groups were used in the pattern.


And finally, a match object has this attribute:

match.re

The regular expression object whose match() or search() method produced this match instance.


(months later) it's easy to add your own cache around re.match, or anything else for that matter --

""" Re.py: Re.match = re.match + cache  
    efficiency: re.py does this already (but what's _MAXCACHE ?)
    readability, inline / separate: matter of taste
"""

import re

cache = {}
_re_type = type( re.compile( "" ))

def match( pattern, str, *opt ):
    """ Re.match = re.match + cache re.compile( pattern ) 
    """
    if type(pattern) == _re_type:
        cpat = pattern
    elif pattern in cache:
        cpat = cache[pattern]
    else:
        cpat = cache[pattern] = re.compile( pattern, *opt )
    return cpat.match( str )

# def search ...

A wibni, wouldn't it be nice if: cachehint( size= ), cacheinfo() -> size, hits, nclear ...


FWIW:

$ python -m timeit -s "import re" "re.match('hello', 'hello world')"
100000 loops, best of 3: 3.82 usec per loop

$ python -m timeit -s "import re; h=re.compile('hello')" "h.match('hello world')"
1000000 loops, best of 3: 1.26 usec per loop

so, if you're going to be using the same regex a lot, it may be worth it to do re.compile (especially for more complex regexes).

The standard arguments against premature optimization apply, but I don't think you really lose much clarity/straightforwardness by using re.compile if you suspect that your regexps may become a performance bottleneck.

Update:

Under Python 3.6 (I suspect the above timings were done using Python 2.x) and 2018 hardware (MacBook Pro), I now get the following timings:

% python -m timeit -s "import re" "re.match('hello', 'hello world')"
1000000 loops, best of 3: 0.661 usec per loop

% python -m timeit -s "import re; h=re.compile('hello')" "h.match('hello world')"
1000000 loops, best of 3: 0.285 usec per loop

% python -m timeit -s "import re" "h=re.compile('hello'); h.match('hello world')"
1000000 loops, best of 3: 0.65 usec per loop

% python --version
Python 3.6.5 :: Anaconda, Inc.

I also added a case (notice the quotation mark differences between the last two runs) that shows that re.match(x, ...) is literally [roughly] equivalent to re.compile(x).match(...), i.e. no behind-the-scenes caching of the compiled representation seems to happen.


This is a good question. You often see people use re.compile without reason. It lessens readability. But sure there are lots of times when pre-compiling the expression is called for. Like when you use it repeated times in a loop or some such.

It's like everything about programming (everything in life actually). Apply common sense.


Regular Expressions are compiled before being used when using the second version. If you are going to executing it many times it is definatly better to compile it first. If not compiling every time you match for one off's is fine.


In general, I find it is easier to use flags (at least easier to remember how), like re.I when compiling patterns than to use flags inline.

>>> foo_pat = re.compile('foo',re.I)
>>> foo_pat.findall('some string FoO bar')
['FoO']

vs

>>> re.findall('(?i)foo','some string FoO bar')
['FoO']

Regular Expressions are compiled before being used when using the second version. If you are going to executing it many times it is definatly better to compile it first. If not compiling every time you match for one off's is fine.


According to the Python documentation:

The sequence

prog = re.compile(pattern)
result = prog.match(string)

is equivalent to

result = re.match(pattern, string)

but using re.compile() and saving the resulting regular expression object for reuse is more efficient when the expression will be used several times in a single program.

So my conclusion is, if you are going to match the same pattern for many different texts, you better precompile it.


i'd like to motivate that pre-compiling is both conceptually and 'literately' (as in 'literate programming') advantageous. have a look at this code snippet:

from re import compile as _Re

class TYPO:

  def text_has_foobar( self, text ):
    return self._text_has_foobar_re_search( text ) is not None
  _text_has_foobar_re_search = _Re( r"""(?i)foobar""" ).search

TYPO = TYPO()

in your application, you'd write:

from TYPO import TYPO
print( TYPO.text_has_foobar( 'FOObar ) )

this is about as simple in terms of functionality as it can get. because this is example is so short, i conflated the way to get _text_has_foobar_re_search all in one line. the disadvantage of this code is that it occupies a little memory for whatever the lifetime of the TYPO library object is; the advantage is that when doing a foobar search, you'll get away with two function calls and two class dictionary lookups. how many regexes are cached by re and the overhead of that cache are irrelevant here.

compare this with the more usual style, below:

import re

class Typo:

  def text_has_foobar( self, text ):
    return re.compile( r"""(?i)foobar""" ).search( text ) is not None

In the application:

typo = Typo()
print( typo.text_has_foobar( 'FOObar ) )

I readily admit that my style is highly unusual for python, maybe even debatable. however, in the example that more closely matches how python is mostly used, in order to do a single match, we must instantiate an object, do three instance dictionary lookups, and perform three function calls; additionally, we might get into re caching troubles when using more than 100 regexes. also, the regular expression gets hidden inside the method body, which most of the time is not such a good idea.

be it said that every subset of measures---targeted, aliased import statements; aliased methods where applicable; reduction of function calls and object dictionary lookups---can help reduce computational and conceptual complexity.


Mostly, there is little difference whether you use re.compile or not. Internally, all of the functions are implemented in terms of a compile step:

def match(pattern, string, flags=0):
    return _compile(pattern, flags).match(string)

def fullmatch(pattern, string, flags=0):
    return _compile(pattern, flags).fullmatch(string)

def search(pattern, string, flags=0):
    return _compile(pattern, flags).search(string)

def sub(pattern, repl, string, count=0, flags=0):
    return _compile(pattern, flags).sub(repl, string, count)

def subn(pattern, repl, string, count=0, flags=0):
    return _compile(pattern, flags).subn(repl, string, count)

def split(pattern, string, maxsplit=0, flags=0):
    return _compile(pattern, flags).split(string, maxsplit)

def findall(pattern, string, flags=0):
    return _compile(pattern, flags).findall(string)

def finditer(pattern, string, flags=0):
    return _compile(pattern, flags).finditer(string)

In addition, re.compile() bypasses the extra indirection and caching logic:

_cache = {}

_pattern_type = type(sre_compile.compile("", 0))

_MAXCACHE = 512
def _compile(pattern, flags):
    # internal: compile pattern
    try:
        p, loc = _cache[type(pattern), pattern, flags]
        if loc is None or loc == _locale.setlocale(_locale.LC_CTYPE):
            return p
    except KeyError:
        pass
    if isinstance(pattern, _pattern_type):
        if flags:
            raise ValueError(
                "cannot process flags argument with a compiled pattern")
        return pattern
    if not sre_compile.isstring(pattern):
        raise TypeError("first argument must be string or compiled pattern")
    p = sre_compile.compile(pattern, flags)
    if not (flags & DEBUG):
        if len(_cache) >= _MAXCACHE:
            _cache.clear()
        if p.flags & LOCALE:
            if not _locale:
                return p
            loc = _locale.setlocale(_locale.LC_CTYPE)
        else:
            loc = None
        _cache[type(pattern), pattern, flags] = p, loc
    return p

In addition to the small speed benefit from using re.compile, people also like the readability that comes from naming potentially complex pattern specifications and separating them from the business logic where there are applied:

#### Patterns ############################################################
number_pattern = re.compile(r'\d+(\.\d*)?')    # Integer or decimal number
assign_pattern = re.compile(r':=')             # Assignment operator
identifier_pattern = re.compile(r'[A-Za-z]+')  # Identifiers
whitespace_pattern = re.compile(r'[\t ]+')     # Spaces and tabs

#### Applications ########################################################

if whitespace_pattern.match(s): business_logic_rule_1()
if assign_pattern.match(s): business_logic_rule_2()

Note, one other respondent incorrectly believed that pyc files stored compiled patterns directly; however, in reality they are rebuilt each time when the PYC is loaded:

>>> from dis import dis
>>> with open('tmp.pyc', 'rb') as f:
        f.read(8)
        dis(marshal.load(f))

  1           0 LOAD_CONST               0 (-1)
              3 LOAD_CONST               1 (None)
              6 IMPORT_NAME              0 (re)
              9 STORE_NAME               0 (re)

  3          12 LOAD_NAME                0 (re)
             15 LOAD_ATTR                1 (compile)
             18 LOAD_CONST               2 ('[aeiou]{2,5}')
             21 CALL_FUNCTION            1
             24 STORE_NAME               2 (lc_vowels)
             27 LOAD_CONST               1 (None)
             30 RETURN_VALUE

The above disassembly comes from the PYC file for a tmp.py containing:

import re
lc_vowels = re.compile(r'[aeiou]{2,5}')

Interestingly, compiling does prove more efficient for me (Python 2.5.2 on Win XP):

import re
import time

rgx = re.compile('(\w+)\s+[0-9_]?\s+\w*')
str = "average    2 never"
a = 0

t = time.time()

for i in xrange(1000000):
    if re.match('(\w+)\s+[0-9_]?\s+\w*', str):
    #~ if rgx.match(str):
        a += 1

print time.time() - t

Running the above code once as is, and once with the two if lines commented the other way around, the compiled regex is twice as fast


(months later) it's easy to add your own cache around re.match, or anything else for that matter --

""" Re.py: Re.match = re.match + cache  
    efficiency: re.py does this already (but what's _MAXCACHE ?)
    readability, inline / separate: matter of taste
"""

import re

cache = {}
_re_type = type( re.compile( "" ))

def match( pattern, str, *opt ):
    """ Re.match = re.match + cache re.compile( pattern ) 
    """
    if type(pattern) == _re_type:
        cpat = pattern
    elif pattern in cache:
        cpat = cache[pattern]
    else:
        cpat = cache[pattern] = re.compile( pattern, *opt )
    return cpat.match( str )

# def search ...

A wibni, wouldn't it be nice if: cachehint( size= ), cacheinfo() -> size, hits, nclear ...


Besides the performance.

Using compile helps me to distinguish the concepts of
1. module(re),
2. regex object
3. match object
When I started learning regex

#regex object
regex_object = re.compile(r'[a-zA-Z]+')
#match object
match_object = regex_object.search('1.Hello')
#matching content
match_object.group()
output:
Out[60]: 'Hello'
V.S.
re.search(r'[a-zA-Z]+','1.Hello').group()
Out[61]: 'Hello'

As a complement, I made an exhaustive cheatsheet of module re for your reference.

regex = {
'brackets':{'single_character': ['[]', '.', {'negate':'^'}],
            'capturing_group' : ['()','(?:)', '(?!)' '|', '\\', 'backreferences and named group'],
            'repetition'      : ['{}', '*?', '+?', '??', 'greedy v.s. lazy ?']},
'lookaround' :{'lookahead'  : ['(?=...)', '(?!...)'],
            'lookbehind' : ['(?<=...)','(?<!...)'],
            'caputuring' : ['(?P<name>...)', '(?P=name)', '(?:)'],},
'escapes':{'anchor'          : ['^', '\b', '$'],
          'non_printable'   : ['\n', '\t', '\r', '\f', '\v'],
          'shorthand'       : ['\d', '\w', '\s']},
'methods': {['search', 'match', 'findall', 'finditer'],
              ['split', 'sub']},
'match_object': ['group','groups', 'groupdict','start', 'end', 'span',]
}

As an alternative answer, as I see that it hasn't been mentioned before, I'll go ahead and quote the Python 3 docs:

Should you use these module-level functions, or should you get the pattern and call its methods yourself? If you’re accessing a regex within a loop, pre-compiling it will save a few function calls. Outside of loops, there’s not much difference thanks to the internal cache.


This is a good question. You often see people use re.compile without reason. It lessens readability. But sure there are lots of times when pre-compiling the expression is called for. Like when you use it repeated times in a loop or some such.

It's like everything about programming (everything in life actually). Apply common sense.


Besides the performance.

Using compile helps me to distinguish the concepts of
1. module(re),
2. regex object
3. match object
When I started learning regex

#regex object
regex_object = re.compile(r'[a-zA-Z]+')
#match object
match_object = regex_object.search('1.Hello')
#matching content
match_object.group()
output:
Out[60]: 'Hello'
V.S.
re.search(r'[a-zA-Z]+','1.Hello').group()
Out[61]: 'Hello'

As a complement, I made an exhaustive cheatsheet of module re for your reference.

regex = {
'brackets':{'single_character': ['[]', '.', {'negate':'^'}],
            'capturing_group' : ['()','(?:)', '(?!)' '|', '\\', 'backreferences and named group'],
            'repetition'      : ['{}', '*?', '+?', '??', 'greedy v.s. lazy ?']},
'lookaround' :{'lookahead'  : ['(?=...)', '(?!...)'],
            'lookbehind' : ['(?<=...)','(?<!...)'],
            'caputuring' : ['(?P<name>...)', '(?P=name)', '(?:)'],},
'escapes':{'anchor'          : ['^', '\b', '$'],
          'non_printable'   : ['\n', '\t', '\r', '\f', '\v'],
          'shorthand'       : ['\d', '\w', '\s']},
'methods': {['search', 'match', 'findall', 'finditer'],
              ['split', 'sub']},
'match_object': ['group','groups', 'groupdict','start', 'end', 'span',]
}

According to the Python documentation:

The sequence

prog = re.compile(pattern)
result = prog.match(string)

is equivalent to

result = re.match(pattern, string)

but using re.compile() and saving the resulting regular expression object for reuse is more efficient when the expression will be used several times in a single program.

So my conclusion is, if you are going to match the same pattern for many different texts, you better precompile it.


Here's a simple test case:

~$ for x in 1 10 100 1000 10000 100000 1000000; do python -m timeit -n $x -s 'import re' 're.match("[0-9]{3}-[0-9]{3}-[0-9]{4}", "123-123-1234")'; done
1 loops, best of 3: 3.1 usec per loop
10 loops, best of 3: 2.41 usec per loop
100 loops, best of 3: 2.24 usec per loop
1000 loops, best of 3: 2.21 usec per loop
10000 loops, best of 3: 2.23 usec per loop
100000 loops, best of 3: 2.24 usec per loop
1000000 loops, best of 3: 2.31 usec per loop

with re.compile:

~$ for x in 1 10 100 1000 10000 100000 1000000; do python -m timeit -n $x -s 'import re' 'r = re.compile("[0-9]{3}-[0-9]{3}-[0-9]{4}")' 'r.match("123-123-1234")'; done
1 loops, best of 3: 1.91 usec per loop
10 loops, best of 3: 0.691 usec per loop
100 loops, best of 3: 0.701 usec per loop
1000 loops, best of 3: 0.684 usec per loop
10000 loops, best of 3: 0.682 usec per loop
100000 loops, best of 3: 0.694 usec per loop
1000000 loops, best of 3: 0.702 usec per loop

So, it would seem to compiling is faster with this simple case, even if you only match once.


Here is an example where using re.compile is over 50 times faster, as requested.

The point is just the same as what I made in the comment above, namely, using re.compile can be a significant advantage when your usage is such as to not benefit much from the compilation cache. This happens at least in one particular case (that I ran into in practice), namely when all of the following are true:

  • You have a lot of regex patterns (more than re._MAXCACHE, whose default is currently 512), and
  • you use these regexes a lot of times, and
  • you consecutive usages of the same pattern are separated by more than re._MAXCACHE other regexes in between, so that each one gets flushed from the cache between consecutive usages.
import re
import time

def setup(N=1000):
    # Patterns 'a.*a', 'a.*b', ..., 'z.*z'
    patterns = [chr(i) + '.*' + chr(j)
                    for i in range(ord('a'), ord('z') + 1)
                    for j in range(ord('a'), ord('z') + 1)]
    # If this assertion below fails, just add more (distinct) patterns.
    # assert(re._MAXCACHE < len(patterns))
    # N strings. Increase N for larger effect.
    strings = ['abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz'] * N
    return (patterns, strings)

def without_compile():
    print('Without re.compile:')
    patterns, strings = setup()
    print('searching')
    count = 0
    for s in strings:
        for pat in patterns:
            count += bool(re.search(pat, s))
    return count

def without_compile_cache_friendly():
    print('Without re.compile, cache-friendly order:')
    patterns, strings = setup()
    print('searching')
    count = 0
    for pat in patterns:
        for s in strings:
            count += bool(re.search(pat, s))
    return count

def with_compile():
    print('With re.compile:')
    patterns, strings = setup()
    print('compiling')
    compiled = [re.compile(pattern) for pattern in patterns]
    print('searching')
    count = 0
    for s in strings:
        for regex in compiled:
            count += bool(regex.search(s))
    return count

start = time.time()
print(with_compile())
d1 = time.time() - start
print(f'-- That took {d1:.2f} seconds.\n')

start = time.time()
print(without_compile_cache_friendly())
d2 = time.time() - start
print(f'-- That took {d2:.2f} seconds.\n')

start = time.time()
print(without_compile())
d3 = time.time() - start
print(f'-- That took {d3:.2f} seconds.\n')

print(f'Ratio: {d3/d1:.2f}')

Example output I get on my laptop (Python 3.7.7):

With re.compile:
compiling
searching
676000
-- That took 0.33 seconds.

Without re.compile, cache-friendly order:
searching
676000
-- That took 0.67 seconds.

Without re.compile:
searching
676000
-- That took 23.54 seconds.

Ratio: 70.89

I didn't bother with timeit as the difference is so stark, but I get qualitatively similar numbers each time. Note that even without re.compile, using the same regex multiple times and moving on to the next one wasn't so bad (only about 2 times as slow as with re.compile), but in the other order (looping through many regexes), it is significantly worse, as expected. Also, increasing the cache size works too: simply setting re._MAXCACHE = len(patterns) in setup() above (of course I don't recommend doing such things in production as names with underscores are conventionally “private”) drops the ~23 seconds back down to ~0.7 seconds, which also matches our understanding.


Mostly, there is little difference whether you use re.compile or not. Internally, all of the functions are implemented in terms of a compile step:

def match(pattern, string, flags=0):
    return _compile(pattern, flags).match(string)

def fullmatch(pattern, string, flags=0):
    return _compile(pattern, flags).fullmatch(string)

def search(pattern, string, flags=0):
    return _compile(pattern, flags).search(string)

def sub(pattern, repl, string, count=0, flags=0):
    return _compile(pattern, flags).sub(repl, string, count)

def subn(pattern, repl, string, count=0, flags=0):
    return _compile(pattern, flags).subn(repl, string, count)

def split(pattern, string, maxsplit=0, flags=0):
    return _compile(pattern, flags).split(string, maxsplit)

def findall(pattern, string, flags=0):
    return _compile(pattern, flags).findall(string)

def finditer(pattern, string, flags=0):
    return _compile(pattern, flags).finditer(string)

In addition, re.compile() bypasses the extra indirection and caching logic:

_cache = {}

_pattern_type = type(sre_compile.compile("", 0))

_MAXCACHE = 512
def _compile(pattern, flags):
    # internal: compile pattern
    try:
        p, loc = _cache[type(pattern), pattern, flags]
        if loc is None or loc == _locale.setlocale(_locale.LC_CTYPE):
            return p
    except KeyError:
        pass
    if isinstance(pattern, _pattern_type):
        if flags:
            raise ValueError(
                "cannot process flags argument with a compiled pattern")
        return pattern
    if not sre_compile.isstring(pattern):
        raise TypeError("first argument must be string or compiled pattern")
    p = sre_compile.compile(pattern, flags)
    if not (flags & DEBUG):
        if len(_cache) >= _MAXCACHE:
            _cache.clear()
        if p.flags & LOCALE:
            if not _locale:
                return p
            loc = _locale.setlocale(_locale.LC_CTYPE)
        else:
            loc = None
        _cache[type(pattern), pattern, flags] = p, loc
    return p

In addition to the small speed benefit from using re.compile, people also like the readability that comes from naming potentially complex pattern specifications and separating them from the business logic where there are applied:

#### Patterns ############################################################
number_pattern = re.compile(r'\d+(\.\d*)?')    # Integer or decimal number
assign_pattern = re.compile(r':=')             # Assignment operator
identifier_pattern = re.compile(r'[A-Za-z]+')  # Identifiers
whitespace_pattern = re.compile(r'[\t ]+')     # Spaces and tabs

#### Applications ########################################################

if whitespace_pattern.match(s): business_logic_rule_1()
if assign_pattern.match(s): business_logic_rule_2()

Note, one other respondent incorrectly believed that pyc files stored compiled patterns directly; however, in reality they are rebuilt each time when the PYC is loaded:

>>> from dis import dis
>>> with open('tmp.pyc', 'rb') as f:
        f.read(8)
        dis(marshal.load(f))

  1           0 LOAD_CONST               0 (-1)
              3 LOAD_CONST               1 (None)
              6 IMPORT_NAME              0 (re)
              9 STORE_NAME               0 (re)

  3          12 LOAD_NAME                0 (re)
             15 LOAD_ATTR                1 (compile)
             18 LOAD_CONST               2 ('[aeiou]{2,5}')
             21 CALL_FUNCTION            1
             24 STORE_NAME               2 (lc_vowels)
             27 LOAD_CONST               1 (None)
             30 RETURN_VALUE

The above disassembly comes from the PYC file for a tmp.py containing:

import re
lc_vowels = re.compile(r'[aeiou]{2,5}')

My understanding is that those two examples are effectively equivalent. The only difference is that in the first, you can reuse the compiled regular expression elsewhere without causing it to be compiled again.

Here's a reference for you: http://diveintopython3.ep.io/refactoring.html

Calling the compiled pattern object's search function with the string 'M' accomplishes the same thing as calling re.search with both the regular expression and the string 'M'. Only much, much faster. (In fact, the re.search function simply compiles the regular expression and calls the resulting pattern object's search method for you.)


I really respect all the above answers. From my opinion Yes! For sure it is worth to use re.compile instead of compiling the regex, again and again, every time.

Using re.compile makes your code more dynamic, as you can call the already compiled regex, instead of compiling again and aagain. This thing benefits you in cases:

  1. Processor Efforts
  2. Time Complexity.
  3. Makes regex Universal.(can be used in findall, search, match)
  4. And makes your program looks cool.

Example :

  example_string = "The room number of her room is 26A7B."
  find_alpha_numeric_string = re.compile(r"\b\w+\b")

Using in Findall

 find_alpha_numeric_string.findall(example_string)

Using in search

  find_alpha_numeric_string.search(example_string)

Similarly you can use it for: Match and Substitute


My understanding is that those two examples are effectively equivalent. The only difference is that in the first, you can reuse the compiled regular expression elsewhere without causing it to be compiled again.

Here's a reference for you: http://diveintopython3.ep.io/refactoring.html

Calling the compiled pattern object's search function with the string 'M' accomplishes the same thing as calling re.search with both the regular expression and the string 'M'. Only much, much faster. (In fact, the re.search function simply compiles the regular expression and calls the resulting pattern object's search method for you.)


I really respect all the above answers. From my opinion Yes! For sure it is worth to use re.compile instead of compiling the regex, again and again, every time.

Using re.compile makes your code more dynamic, as you can call the already compiled regex, instead of compiling again and aagain. This thing benefits you in cases:

  1. Processor Efforts
  2. Time Complexity.
  3. Makes regex Universal.(can be used in findall, search, match)
  4. And makes your program looks cool.

Example :

  example_string = "The room number of her room is 26A7B."
  find_alpha_numeric_string = re.compile(r"\b\w+\b")

Using in Findall

 find_alpha_numeric_string.findall(example_string)

Using in search

  find_alpha_numeric_string.search(example_string)

Similarly you can use it for: Match and Substitute


Here is an example where using re.compile is over 50 times faster, as requested.

The point is just the same as what I made in the comment above, namely, using re.compile can be a significant advantage when your usage is such as to not benefit much from the compilation cache. This happens at least in one particular case (that I ran into in practice), namely when all of the following are true:

  • You have a lot of regex patterns (more than re._MAXCACHE, whose default is currently 512), and
  • you use these regexes a lot of times, and
  • you consecutive usages of the same pattern are separated by more than re._MAXCACHE other regexes in between, so that each one gets flushed from the cache between consecutive usages.
import re
import time

def setup(N=1000):
    # Patterns 'a.*a', 'a.*b', ..., 'z.*z'
    patterns = [chr(i) + '.*' + chr(j)
                    for i in range(ord('a'), ord('z') + 1)
                    for j in range(ord('a'), ord('z') + 1)]
    # If this assertion below fails, just add more (distinct) patterns.
    # assert(re._MAXCACHE < len(patterns))
    # N strings. Increase N for larger effect.
    strings = ['abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz'] * N
    return (patterns, strings)

def without_compile():
    print('Without re.compile:')
    patterns, strings = setup()
    print('searching')
    count = 0
    for s in strings:
        for pat in patterns:
            count += bool(re.search(pat, s))
    return count

def without_compile_cache_friendly():
    print('Without re.compile, cache-friendly order:')
    patterns, strings = setup()
    print('searching')
    count = 0
    for pat in patterns:
        for s in strings:
            count += bool(re.search(pat, s))
    return count

def with_compile():
    print('With re.compile:')
    patterns, strings = setup()
    print('compiling')
    compiled = [re.compile(pattern) for pattern in patterns]
    print('searching')
    count = 0
    for s in strings:
        for regex in compiled:
            count += bool(regex.search(s))
    return count

start = time.time()
print(with_compile())
d1 = time.time() - start
print(f'-- That took {d1:.2f} seconds.\n')

start = time.time()
print(without_compile_cache_friendly())
d2 = time.time() - start
print(f'-- That took {d2:.2f} seconds.\n')

start = time.time()
print(without_compile())
d3 = time.time() - start
print(f'-- That took {d3:.2f} seconds.\n')

print(f'Ratio: {d3/d1:.2f}')

Example output I get on my laptop (Python 3.7.7):

With re.compile:
compiling
searching
676000
-- That took 0.33 seconds.

Without re.compile, cache-friendly order:
searching
676000
-- That took 0.67 seconds.

Without re.compile:
searching
676000
-- That took 23.54 seconds.

Ratio: 70.89

I didn't bother with timeit as the difference is so stark, but I get qualitatively similar numbers each time. Note that even without re.compile, using the same regex multiple times and moving on to the next one wasn't so bad (only about 2 times as slow as with re.compile), but in the other order (looping through many regexes), it is significantly worse, as expected. Also, increasing the cache size works too: simply setting re._MAXCACHE = len(patterns) in setup() above (of course I don't recommend doing such things in production as names with underscores are conventionally “private”) drops the ~23 seconds back down to ~0.7 seconds, which also matches our understanding.


Interestingly, compiling does prove more efficient for me (Python 2.5.2 on Win XP):

import re
import time

rgx = re.compile('(\w+)\s+[0-9_]?\s+\w*')
str = "average    2 never"
a = 0

t = time.time()

for i in xrange(1000000):
    if re.match('(\w+)\s+[0-9_]?\s+\w*', str):
    #~ if rgx.match(str):
        a += 1

print time.time() - t

Running the above code once as is, and once with the two if lines commented the other way around, the compiled regex is twice as fast


In general, I find it is easier to use flags (at least easier to remember how), like re.I when compiling patterns than to use flags inline.

>>> foo_pat = re.compile('foo',re.I)
>>> foo_pat.findall('some string FoO bar')
['FoO']

vs

>>> re.findall('(?i)foo','some string FoO bar')
['FoO']

I just tried this myself. For the simple case of parsing a number out of a string and summing it, using a compiled regular expression object is about twice as fast as using the re methods.

As others have pointed out, the re methods (including re.compile) look up the regular expression string in a cache of previously compiled expressions. Therefore, in the normal case, the extra cost of using the re methods is simply the cost of the cache lookup.

However, examination of the code, shows the cache is limited to 100 expressions. This begs the question, how painful is it to overflow the cache? The code contains an internal interface to the regular expression compiler, re.sre_compile.compile. If we call it, we bypass the cache. It turns out to be about two orders of magnitude slower for a basic regular expression, such as r'\w+\s+([0-9_]+)\s+\w*'.

Here's my test:

#!/usr/bin/env python
import re
import time

def timed(func):
    def wrapper(*args):
        t = time.time()
        result = func(*args)
        t = time.time() - t
        print '%s took %.3f seconds.' % (func.func_name, t)
        return result
    return wrapper

regularExpression = r'\w+\s+([0-9_]+)\s+\w*'
testString = "average    2 never"

@timed
def noncompiled():
    a = 0
    for x in xrange(1000000):
        m = re.match(regularExpression, testString)
        a += int(m.group(1))
    return a

@timed
def compiled():
    a = 0
    rgx = re.compile(regularExpression)
    for x in xrange(1000000):
        m = rgx.match(testString)
        a += int(m.group(1))
    return a

@timed
def reallyCompiled():
    a = 0
    rgx = re.sre_compile.compile(regularExpression)
    for x in xrange(1000000):
        m = rgx.match(testString)
        a += int(m.group(1))
    return a


@timed
def compiledInLoop():
    a = 0
    for x in xrange(1000000):
        rgx = re.compile(regularExpression)
        m = rgx.match(testString)
        a += int(m.group(1))
    return a

@timed
def reallyCompiledInLoop():
    a = 0
    for x in xrange(10000):
        rgx = re.sre_compile.compile(regularExpression)
        m = rgx.match(testString)
        a += int(m.group(1))
    return a

r1 = noncompiled()
r2 = compiled()
r3 = reallyCompiled()
r4 = compiledInLoop()
r5 = reallyCompiledInLoop()
print "r1 = ", r1
print "r2 = ", r2
print "r3 = ", r3
print "r4 = ", r4
print "r5 = ", r5
</pre>
And here is the output on my machine:
<pre>
$ regexTest.py 
noncompiled took 4.555 seconds.
compiled took 2.323 seconds.
reallyCompiled took 2.325 seconds.
compiledInLoop took 4.620 seconds.
reallyCompiledInLoop took 4.074 seconds.
r1 =  2000000
r2 =  2000000
r3 =  2000000
r4 =  2000000
r5 =  20000

The 'reallyCompiled' methods use the internal interface, which bypasses the cache. Note the one that compiles on each loop iteration is only iterated 10,000 times, not one million.


I ran this test before stumbling upon the discussion here. However, having run it I thought I'd at least post my results.

I stole and bastardized the example in Jeff Friedl's "Mastering Regular Expressions". This is on a macbook running OSX 10.6 (2Ghz intel core 2 duo, 4GB ram). Python version is 2.6.1.

Run 1 - using re.compile

import re 
import time 
import fpformat
Regex1 = re.compile('^(a|b|c|d|e|f|g)+$') 
Regex2 = re.compile('^[a-g]+$')
TimesToDo = 1000
TestString = "" 
for i in range(1000):
    TestString += "abababdedfg"
StartTime = time.time() 
for i in range(TimesToDo):
    Regex1.search(TestString) 
Seconds = time.time() - StartTime 
print "Alternation takes " + fpformat.fix(Seconds,3) + " seconds"

StartTime = time.time() 
for i in range(TimesToDo):
    Regex2.search(TestString) 
Seconds = time.time() - StartTime 
print "Character Class takes " + fpformat.fix(Seconds,3) + " seconds"

Alternation takes 2.299 seconds
Character Class takes 0.107 seconds

Run 2 - Not using re.compile

import re 
import time 
import fpformat

TimesToDo = 1000
TestString = "" 
for i in range(1000):
    TestString += "abababdedfg"
StartTime = time.time() 
for i in range(TimesToDo):
    re.search('^(a|b|c|d|e|f|g)+$',TestString) 
Seconds = time.time() - StartTime 
print "Alternation takes " + fpformat.fix(Seconds,3) + " seconds"

StartTime = time.time() 
for i in range(TimesToDo):
    re.search('^[a-g]+$',TestString) 
Seconds = time.time() - StartTime 
print "Character Class takes " + fpformat.fix(Seconds,3) + " seconds"

Alternation takes 2.508 seconds
Character Class takes 0.109 seconds