[regex] What special characters must be escaped in regular expressions?

I am tired of always trying to guess, if I should escape special characters like '()[]{}|' etc. when using many implementations of regexps.

It is different with, for example, Python, sed, grep, awk, Perl, rename, Apache, find and so on. Is there any rule set which tells when I should, and when I should not, escape special characters? Does it depend on the regexp type, like PCRE, POSIX or extended regexps?

This question is related to regex

The answer is


Unfortunately there really isn't a set set of escape codes since it varies based on the language you are using.

However, keeping a page like the Regular Expression Tools Page or this Regular Expression Cheatsheet can go a long way to help you quickly filter things out.


To know when and what to escape without attempts is necessary to understand precisely the chain of contexts the string pass through. You will specify the string from the farthest side to its final destination which is the memory handled by the regexp parsing code.

Be aware how the string in memory is processed: if can be a plain string inside the code, or a string entered to the command line, but a could be either an interactive command line or a command line stated inside a shell script file, or inside a variable in memory mentioned by the code, or an (string)argument through further evaluation, or a string containing code generated dynamically with any sort of encapsulation...

Each of this context assigned some characters with special functionality.

When you want to pass the character literally without using its special function (local to the context), than that's the case you have to escape it, for the next context... which might need some other escape characters which might additionally need to be escaped in the preceding context(s). Furthermore there can be things like character encoding (the most insidious is utf-8 because it look like ASCII for common characters, but might be optionally interpreted even by the terminal depending on its settings so it might behave differently, then the encoding attribute of HTML/XML, it's necessary to understand the process precisely right.

E.g. A regexp in the command line starting with perl -npe, needs to be transferred to a set of exec system calls connecting as pipe the file handles, each of this exec system calls just has a list of arguments that were separated by (non escaped)spaces, and possibly pipes(|) and redirection (> N> N>&M), parenthesis, interactive expansion of * and ?, $(()) ... (all this are special characters used by the *sh which might appear to interfere with the character of the regular expression in the next context, but they are evaluated in order: before the command line. The command line is read by a program as bash/sh/csh/tcsh/zsh, essentially inside double quote or single quote the escape is simpler but it is not necessary to quote a string in the command line because mostly the space has to be prefixed with backslash and the quote are not necessary leaving available the expand functionality for characters * and ?, but this parse as different context as within quote. Then when the command line is evaluated the regexp obtained in memory (not as written in the command line) receives the same treatment as it would be in a source file. For regexp there is character-set context within square brackets [ ], perl regular expression can be quoted by a large set of non alfa-numeric characters (E.g. m// or m:/better/for/path: ...).

You have more details about characters in other answer, which are very specific to the final regexp context. As I noted you mention that you find the regexp escape with attempts, that's probably because different context has different set of character that confused your memory of attempts (often backslash is the character used in those different context to escape a literal character instead of its function).


Unfortunately there really isn't a set set of escape codes since it varies based on the language you are using.

However, keeping a page like the Regular Expression Tools Page or this Regular Expression Cheatsheet can go a long way to help you quickly filter things out.


Unfortunately, the meaning of things like ( and \( are swapped between Emacs style regular expressions and most other styles. So if you try to escape these you may be doing the opposite of what you want.

So you really have to know what style you are trying to quote.


POSIX recognizes multiple variations on regular expressions - basic regular expressions (BRE) and extended regular expressions (ERE). And even then, there are quirks because of the historical implementations of the utilities standardized by POSIX.

There isn't a simple rule for when to use which notation, or even which notation a given command uses.

Check out Jeff Friedl's Mastering Regular Expressions book.


Really, there isn't. there are about a half-zillion different regex syntaxes; they seem to come down to Perl, EMACS/GNU, and AT&T in general, but I'm always getting surprised too.


To know when and what to escape without attempts is necessary to understand precisely the chain of contexts the string pass through. You will specify the string from the farthest side to its final destination which is the memory handled by the regexp parsing code.

Be aware how the string in memory is processed: if can be a plain string inside the code, or a string entered to the command line, but a could be either an interactive command line or a command line stated inside a shell script file, or inside a variable in memory mentioned by the code, or an (string)argument through further evaluation, or a string containing code generated dynamically with any sort of encapsulation...

Each of this context assigned some characters with special functionality.

When you want to pass the character literally without using its special function (local to the context), than that's the case you have to escape it, for the next context... which might need some other escape characters which might additionally need to be escaped in the preceding context(s). Furthermore there can be things like character encoding (the most insidious is utf-8 because it look like ASCII for common characters, but might be optionally interpreted even by the terminal depending on its settings so it might behave differently, then the encoding attribute of HTML/XML, it's necessary to understand the process precisely right.

E.g. A regexp in the command line starting with perl -npe, needs to be transferred to a set of exec system calls connecting as pipe the file handles, each of this exec system calls just has a list of arguments that were separated by (non escaped)spaces, and possibly pipes(|) and redirection (> N> N>&M), parenthesis, interactive expansion of * and ?, $(()) ... (all this are special characters used by the *sh which might appear to interfere with the character of the regular expression in the next context, but they are evaluated in order: before the command line. The command line is read by a program as bash/sh/csh/tcsh/zsh, essentially inside double quote or single quote the escape is simpler but it is not necessary to quote a string in the command line because mostly the space has to be prefixed with backslash and the quote are not necessary leaving available the expand functionality for characters * and ?, but this parse as different context as within quote. Then when the command line is evaluated the regexp obtained in memory (not as written in the command line) receives the same treatment as it would be in a source file. For regexp there is character-set context within square brackets [ ], perl regular expression can be quoted by a large set of non alfa-numeric characters (E.g. m// or m:/better/for/path: ...).

You have more details about characters in other answer, which are very specific to the final regexp context. As I noted you mention that you find the regexp escape with attempts, that's probably because different context has different set of character that confused your memory of attempts (often backslash is the character used in those different context to escape a literal character instead of its function).


POSIX recognizes multiple variations on regular expressions - basic regular expressions (BRE) and extended regular expressions (ERE). And even then, there are quirks because of the historical implementations of the utilities standardized by POSIX.

There isn't a simple rule for when to use which notation, or even which notation a given command uses.

Check out Jeff Friedl's Mastering Regular Expressions book.


Unfortunately, the meaning of things like ( and \( are swapped between Emacs style regular expressions and most other styles. So if you try to escape these you may be doing the opposite of what you want.

So you really have to know what style you are trying to quote.


For Ionic (Typescript) you have to double slash in order to scape the characters. For example (this is to match some special characters):

"^(?=.*[\\]\\[!¡\'=ªº\\-\\_ç@#$%^&*(),;\\.?\":{}|<>\+\\/])"

Pay attention to this ] [ - _ . / characters. They have to be double slashed. If you don't do that, you are going to have a type error in your code.


https://perldoc.perl.org/perlre.html#Quoting-metacharacters and https://perldoc.perl.org/functions/quotemeta.html

In the official documentation, such characters are called metacharacters. Example of quoting:

my $regex = quotemeta($string)
s/$regex/something/

Unfortunately, the meaning of things like ( and \( are swapped between Emacs style regular expressions and most other styles. So if you try to escape these you may be doing the opposite of what you want.

So you really have to know what style you are trying to quote.


For PHP, "it is always safe to precede a non-alphanumeric with "\" to specify that it stands for itself." - http://php.net/manual/en/regexp.reference.escape.php.

Except if it's a " or '. :/

To escape regex pattern variables (or partial variables) in PHP use preg_quote()


Maybe an old thread, but this code might be useful to visitors who want to create without regex

def listToString(s):  
    
    # initialize an empty string 
    str1 = "" 
    
    # return string   
    return (str1.join(s))


r = "Hello! How are you? *Smiling_Face* *Heart* erwer"
r1 = list(r)
i = 0
r2 = list()
start = True

for string in r1:
    if string == "*":
        if(start):
            start = False
        else:
            start = True
    else:
        if(start):
            r2.append(string)
        else:
            print("skipped" + string)
            
 
print(listToString(r2))

Sometimes simple escaping is not possible with the characters you've listed. For example, using a backslash to escape a bracket isn't going to work in the left hand side of a substitution string in sed, namely

sed -e 's/foo\(bar/something_else/'

I tend to just use a simple character class definition instead, so the above expression becomes

sed -e 's/foo[(]bar/something_else/'

which I find works for most regexp implementations.

BTW Character classes are pretty vanilla regexp components so they tend to work in most situations where you need escaped characters in regexps.

Edit: After the comment below, just thought I'd mention the fact that you also have to consider the difference between finite state automata and non-finite state automata when looking at the behaviour of regexp evaluation.

You might like to look at "the shiny ball book" aka Effective Perl (sanitised Amazon link), specifically the chapter on regular expressions, to get a feel for then difference in regexp engine evaluation types.

Not all the world's a PCRE!

Anyway, regexp's are so clunky compared to SNOBOL! Now that was an interesting programming course! Along with the one on Simula.

Ah the joys of studying at UNSW in the late '70's! (-:


Maybe an old thread, but this code might be useful to visitors who want to create without regex

def listToString(s):  
    
    # initialize an empty string 
    str1 = "" 
    
    # return string   
    return (str1.join(s))


r = "Hello! How are you? *Smiling_Face* *Heart* erwer"
r1 = list(r)
i = 0
r2 = list()
start = True

for string in r1:
    if string == "*":
        if(start):
            start = False
        else:
            start = True
    else:
        if(start):
            r2.append(string)
        else:
            print("skipped" + string)
            
 
print(listToString(r2))

Really, there isn't. there are about a half-zillion different regex syntaxes; they seem to come down to Perl, EMACS/GNU, and AT&T in general, but I'm always getting surprised too.


Modern RegEx Flavors (PCRE)

Includes C, C++, Delphi, EditPad, Java, JavaScript, Perl, PHP (preg), PostgreSQL, PowerGREP, PowerShell, Python, REALbasic, Real Studio, Ruby, TCL, VB.Net, VBScript, wxWidgets, XML Schema, Xojo, XRegExp.
PCRE compatibility may vary

    Anywhere: . ^ $ * + - ? ( ) [ ] { } \ |


Legacy RegEx Flavors (BRE/ERE)

Includes awk, ed, egrep, emacs, GNUlib, grep, PHP (ereg), MySQL, Oracle, R, sed.
PCRE support may be enabled in later versions or by using extensions

ERE/awk/egrep/emacs

    Outside a character class: . ^ $ * + ? ( ) [ { } \ |
    Inside a character class: ^ - [ ]

BRE/ed/grep/sed

    Outside a character class: . ^ $ * [ \
    Inside a character class: ^ - [ ]
    For literals, don't escape: + ? ( ) { } |
    For standard regex behavior, escape: \+ \? \( \) \{ \} \|


Notes

  • If unsure about a specific character, it can be escaped like \xFF
  • Alphanumeric characters cannot be escaped with a backslash
  • Arbitrary symbols can be escaped with a backslash in PCRE, but not BRE/ERE (they must only be escaped when required). For PCRE ] - only need escaping within a character class, but I kept them in a single list for simplicity
  • Quoted expression strings must also have the surrounding quote characters escaped, and often with backslashes doubled-up (like "(\")(/)(\\.)" versus /(")(\/)(\.)/ in JavaScript)
  • Aside from escapes, different regex implementations may support different modifiers, character classes, anchors, quantifiers, and other features. For more details, check out regular-expressions.info, or use regex101.com to test your expressions live

Really, there isn't. there are about a half-zillion different regex syntaxes; they seem to come down to Perl, EMACS/GNU, and AT&T in general, but I'm always getting surprised too.


https://perldoc.perl.org/perlre.html#Quoting-metacharacters and https://perldoc.perl.org/functions/quotemeta.html

In the official documentation, such characters are called metacharacters. Example of quoting:

my $regex = quotemeta($string)
s/$regex/something/

POSIX recognizes multiple variations on regular expressions - basic regular expressions (BRE) and extended regular expressions (ERE). And even then, there are quirks because of the historical implementations of the utilities standardized by POSIX.

There isn't a simple rule for when to use which notation, or even which notation a given command uses.

Check out Jeff Friedl's Mastering Regular Expressions book.


Unfortunately there really isn't a set set of escape codes since it varies based on the language you are using.

However, keeping a page like the Regular Expression Tools Page or this Regular Expression Cheatsheet can go a long way to help you quickly filter things out.


POSIX recognizes multiple variations on regular expressions - basic regular expressions (BRE) and extended regular expressions (ERE). And even then, there are quirks because of the historical implementations of the utilities standardized by POSIX.

There isn't a simple rule for when to use which notation, or even which notation a given command uses.

Check out Jeff Friedl's Mastering Regular Expressions book.


Sometimes simple escaping is not possible with the characters you've listed. For example, using a backslash to escape a bracket isn't going to work in the left hand side of a substitution string in sed, namely

sed -e 's/foo\(bar/something_else/'

I tend to just use a simple character class definition instead, so the above expression becomes

sed -e 's/foo[(]bar/something_else/'

which I find works for most regexp implementations.

BTW Character classes are pretty vanilla regexp components so they tend to work in most situations where you need escaped characters in regexps.

Edit: After the comment below, just thought I'd mention the fact that you also have to consider the difference between finite state automata and non-finite state automata when looking at the behaviour of regexp evaluation.

You might like to look at "the shiny ball book" aka Effective Perl (sanitised Amazon link), specifically the chapter on regular expressions, to get a feel for then difference in regexp engine evaluation types.

Not all the world's a PCRE!

Anyway, regexp's are so clunky compared to SNOBOL! Now that was an interesting programming course! Along with the one on Simula.

Ah the joys of studying at UNSW in the late '70's! (-:


Unfortunately, the meaning of things like ( and \( are swapped between Emacs style regular expressions and most other styles. So if you try to escape these you may be doing the opposite of what you want.

So you really have to know what style you are trying to quote.


Sometimes simple escaping is not possible with the characters you've listed. For example, using a backslash to escape a bracket isn't going to work in the left hand side of a substitution string in sed, namely

sed -e 's/foo\(bar/something_else/'

I tend to just use a simple character class definition instead, so the above expression becomes

sed -e 's/foo[(]bar/something_else/'

which I find works for most regexp implementations.

BTW Character classes are pretty vanilla regexp components so they tend to work in most situations where you need escaped characters in regexps.

Edit: After the comment below, just thought I'd mention the fact that you also have to consider the difference between finite state automata and non-finite state automata when looking at the behaviour of regexp evaluation.

You might like to look at "the shiny ball book" aka Effective Perl (sanitised Amazon link), specifically the chapter on regular expressions, to get a feel for then difference in regexp engine evaluation types.

Not all the world's a PCRE!

Anyway, regexp's are so clunky compared to SNOBOL! Now that was an interesting programming course! Along with the one on Simula.

Ah the joys of studying at UNSW in the late '70's! (-:


Modern RegEx Flavors (PCRE)

Includes C, C++, Delphi, EditPad, Java, JavaScript, Perl, PHP (preg), PostgreSQL, PowerGREP, PowerShell, Python, REALbasic, Real Studio, Ruby, TCL, VB.Net, VBScript, wxWidgets, XML Schema, Xojo, XRegExp.
PCRE compatibility may vary

    Anywhere: . ^ $ * + - ? ( ) [ ] { } \ |


Legacy RegEx Flavors (BRE/ERE)

Includes awk, ed, egrep, emacs, GNUlib, grep, PHP (ereg), MySQL, Oracle, R, sed.
PCRE support may be enabled in later versions or by using extensions

ERE/awk/egrep/emacs

    Outside a character class: . ^ $ * + ? ( ) [ { } \ |
    Inside a character class: ^ - [ ]

BRE/ed/grep/sed

    Outside a character class: . ^ $ * [ \
    Inside a character class: ^ - [ ]
    For literals, don't escape: + ? ( ) { } |
    For standard regex behavior, escape: \+ \? \( \) \{ \} \|


Notes

  • If unsure about a specific character, it can be escaped like \xFF
  • Alphanumeric characters cannot be escaped with a backslash
  • Arbitrary symbols can be escaped with a backslash in PCRE, but not BRE/ERE (they must only be escaped when required). For PCRE ] - only need escaping within a character class, but I kept them in a single list for simplicity
  • Quoted expression strings must also have the surrounding quote characters escaped, and often with backslashes doubled-up (like "(\")(/)(\\.)" versus /(")(\/)(\.)/ in JavaScript)
  • Aside from escapes, different regex implementations may support different modifiers, character classes, anchors, quantifiers, and other features. For more details, check out regular-expressions.info, or use regex101.com to test your expressions live

Really, there isn't. there are about a half-zillion different regex syntaxes; they seem to come down to Perl, EMACS/GNU, and AT&T in general, but I'm always getting surprised too.


For Ionic (Typescript) you have to double slash in order to scape the characters. For example (this is to match some special characters):

"^(?=.*[\\]\\[!¡\'=ªº\\-\\_ç@#$%^&*(),;\\.?\":{}|<>\+\\/])"

Pay attention to this ] [ - _ . / characters. They have to be double slashed. If you don't do that, you are going to have a type error in your code.


Sometimes simple escaping is not possible with the characters you've listed. For example, using a backslash to escape a bracket isn't going to work in the left hand side of a substitution string in sed, namely

sed -e 's/foo\(bar/something_else/'

I tend to just use a simple character class definition instead, so the above expression becomes

sed -e 's/foo[(]bar/something_else/'

which I find works for most regexp implementations.

BTW Character classes are pretty vanilla regexp components so they tend to work in most situations where you need escaped characters in regexps.

Edit: After the comment below, just thought I'd mention the fact that you also have to consider the difference between finite state automata and non-finite state automata when looking at the behaviour of regexp evaluation.

You might like to look at "the shiny ball book" aka Effective Perl (sanitised Amazon link), specifically the chapter on regular expressions, to get a feel for then difference in regexp engine evaluation types.

Not all the world's a PCRE!

Anyway, regexp's are so clunky compared to SNOBOL! Now that was an interesting programming course! Along with the one on Simula.

Ah the joys of studying at UNSW in the late '70's! (-:


For PHP, "it is always safe to precede a non-alphanumeric with "\" to specify that it stands for itself." - http://php.net/manual/en/regexp.reference.escape.php.

Except if it's a " or '. :/

To escape regex pattern variables (or partial variables) in PHP use preg_quote()