[python] Process escape sequences in a string in Python

Sometimes when I get input from a file or the user, I get a string with escape sequences in it. I would like to process the escape sequences in the same way that Python processes escape sequences in string literals.

For example, let's say myString is defined as:

>>> myString = "spam\\neggs"
>>> print(myString)
spam\neggs

I want a function (I'll call it process) that does this:

>>> print(process(myString))
spam
eggs

It's important that the function can process all of the escape sequences in Python (listed in a table in the link above).

Does Python have a function to do this?

This question is related to python string escaping

The answer is


The correct thing to do is use the 'string-escape' code to decode the string.

>>> myString = "spam\\neggs"
>>> decoded_string = bytes(myString, "utf-8").decode("unicode_escape") # python3 
>>> decoded_string = myString.decode('string_escape') # python2
>>> print(decoded_string)
spam
eggs

Don't use the AST or eval. Using the string codecs is much safer.


The actually correct and convenient answer for python 3:

>>> import codecs
>>> myString = "spam\\neggs"
>>> print(codecs.escape_decode(bytes(myString, "utf-8"))[0].decode("utf-8"))
spam
eggs
>>> myString = "naïve \\t test"
>>> print(codecs.escape_decode(bytes(myString, "utf-8"))[0].decode("utf-8"))
naïve    test

Details regarding codecs.escape_decode:

  • codecs.escape_decode is a bytes-to-bytes decoder
  • codecs.escape_decode decodes ascii escape sequences, such as: b"\\n" -> b"\n", b"\\xce" -> b"\xce".
  • codecs.escape_decode does not care or need to know about the byte object's encoding, but the encoding of the escaped bytes should match the encoding of the rest of the object.

Background:

  • @rspeer is correct: unicode_escape is the incorrect solution for python3. This is because unicode_escape decodes escaped bytes, then decodes bytes to unicode string, but receives no information regarding which codec to use for the second operation.
  • @Jerub is correct: avoid the AST or eval.
  • I first discovered codecs.escape_decode from this answer to "how do I .decode('string-escape') in Python3?". As that answer states, that function is currently not documented for python 3.

unicode_escape doesn't work in general

It turns out that the string_escape or unicode_escape solution does not work in general -- particularly, it doesn't work in the presence of actual Unicode.

If you can be sure that every non-ASCII character will be escaped (and remember, anything beyond the first 128 characters is non-ASCII), unicode_escape will do the right thing for you. But if there are any literal non-ASCII characters already in your string, things will go wrong.

unicode_escape is fundamentally designed to convert bytes into Unicode text. But in many places -- for example, Python source code -- the source data is already Unicode text.

The only way this can work correctly is if you encode the text into bytes first. UTF-8 is the sensible encoding for all text, so that should work, right?

The following examples are in Python 3, so that the string literals are cleaner, but the same problem exists with slightly different manifestations on both Python 2 and 3.

>>> s = 'naïve \\t test'
>>> print(s.encode('utf-8').decode('unicode_escape'))
naïve   test

Well, that's wrong.

The new recommended way to use codecs that decode text into text is to call codecs.decode directly. Does that help?

>>> import codecs
>>> print(codecs.decode(s, 'unicode_escape'))
naïve   test

Not at all. (Also, the above is a UnicodeError on Python 2.)

The unicode_escape codec, despite its name, turns out to assume that all non-ASCII bytes are in the Latin-1 (ISO-8859-1) encoding. So you would have to do it like this:

>>> print(s.encode('latin-1').decode('unicode_escape'))
naïve    test

But that's terrible. This limits you to the 256 Latin-1 characters, as if Unicode had never been invented at all!

>>> print('Erno \\t Rubik'.encode('latin-1').decode('unicode_escape'))
UnicodeEncodeError: 'latin-1' codec can't encode character '\u0151'
in position 3: ordinal not in range(256)

Adding a regular expression to solve the problem

(Surprisingly, we do not now have two problems.)

What we need to do is only apply the unicode_escape decoder to things that we are certain to be ASCII text. In particular, we can make sure only to apply it to valid Python escape sequences, which are guaranteed to be ASCII text.

The plan is, we'll find escape sequences using a regular expression, and use a function as the argument to re.sub to replace them with their unescaped value.

import re
import codecs

ESCAPE_SEQUENCE_RE = re.compile(r'''
    ( \\U........      # 8-digit hex escapes
    | \\u....          # 4-digit hex escapes
    | \\x..            # 2-digit hex escapes
    | \\[0-7]{1,3}     # Octal escapes
    | \\N\{[^}]+\}     # Unicode characters by name
    | \\[\\'"abfnrtv]  # Single-character escapes
    )''', re.UNICODE | re.VERBOSE)

def decode_escapes(s):
    def decode_match(match):
        return codecs.decode(match.group(0), 'unicode-escape')

    return ESCAPE_SEQUENCE_RE.sub(decode_match, s)

And with that:

>>> print(decode_escapes('Erno \\t Rubik'))
Erno     Rubik

The (currently) accepted answer by Jerub is correct for python2, but incorrect and may produce garbled results (as Apalala points out in a comment to that solution), for python3. That's because the unicode_escape codec requires its source to be coded in latin-1, not utf-8, as per the official python docs. Hence, in python3 use:

>>> myString="špåm\\nëðþ\\x73"
>>> print(myString)
špåm\nëðþ\x73
>>> decoded_string = myString.encode('latin-1','backslashreplace').decode('unicode_escape')
>>> print(decoded_string)
špåm
ëðþs

This method also avoids the extra unnecessary roundtrip between strings and bytes in metatoaster's comments to Jerub's solution (but hats off to metatoaster for recognizing the bug in that solution).


This is a bad way of doing it, but it worked for me when trying to interpret escaped octals passed in a string argument.

input_string = eval('b"' + sys.argv[1] + '"')

It's worth mentioning that there is a difference between eval and ast.literal_eval (eval being way more unsafe). See Using python's eval() vs. ast.literal_eval()?


Below code should work for \n is required to be displayed on the string.

import string

our_str = 'The String is \\n, \\n and \\n!'
new_str = string.replace(our_str, '/\\n', '/\n', 1)
print(new_str)

The ast.literal_eval function comes close, but it will expect the string to be properly quoted first.

Of course Python's interpretation of backslash escapes depends on how the string is quoted ("" vs r"" vs u"", triple quotes, etc) so you may want to wrap the user input in suitable quotes and pass to literal_eval. Wrapping it in quotes will also prevent literal_eval from returning a number, tuple, dictionary, etc.

Things still might get tricky if the user types unquoted quotes of the type you intend to wrap around the string.


Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to string

How to split a string in two and store it in a field String method cannot be found in a main class method Kotlin - How to correctly concatenate a String Replacing a character from a certain index Remove quotes from String in Python Detect whether a Python string is a number or a letter How does String substring work in Swift How does String.Index work in Swift swift 3.0 Data to String? How to parse JSON string in Typescript

Examples related to escaping

Uses for the '"' entity in HTML Javascript - How to show escape characters in a string? How to print a single backslash? How to escape special characters of a string with single backslashes Saving utf-8 texts with json.dumps as UTF8, not as \u escape sequence Properly escape a double quote in CSV How to Git stash pop specific stash in 1.8.3? In Java, should I escape a single quotation mark (') in String (double quoted)? How do I escape a single quote ( ' ) in JavaScript? Which characters need to be escaped when using Bash?