What exactly do u and r string flags do and what are raw string literals

Question

While asking this question  I realized I didn t know much about raw strings  For somebody claiming to be a Django trainer  this sucks   I know what an encoding is  and I know what u   alone does since I get what is Unicode    But what does r   do exactly  What kind of string does it result in  And above all  what the heck does ur   do  Finally  is there any reliable way to go back from a Unicode string to a simple raw string  Ah  and by the way  if your system and your text editor charset are set to UTF-8  does u   actually do anything

User · Answer

Unicode string literals

Unicode string literals (string literals prefixed by u) are no longer used in Python 3. They are still valid but just for compatibility purposes with Python 2.

Raw string literals

If you want to create a string literal consisting of only easily typable characters like english letters or numbers, you can simply type them: 'hello world'. But if you want to include also some more exotic characters, you'll have to use some workaround. One of the workarounds are Escape sequences. This way you can for example represent a new line in your string simply by adding two easily typable characters \n to your string literal. So when you print the 'hello\nworld' string, the words will be printed on separate lines. That's very handy!

On the other hand, there are some situations when you want to create a string literal that contains escape sequences but you don't want them to be interpreted by Python. You want them to be raw. Look at these examples:

'New updates are ready in c:\windows\updates\new'
'In this lesson we will learn what the \n escape sequence does.'

In such situations you can just prefix the string literal with the r character like this: r'hello\nworld' and no escape sequences will be interpreted by Python. The string will be printed exactly as you created it.

Raw string literals are not completely "raw"?

Many people expect the raw string literals to be raw in a sense that "anything placed between the quotes is ignored by Python". That is not true. Python still recognizes all the escape sequences, it just does not interpret them - it leaves them unchanged instead. It means that raw string literals still have to be valid string literals.

From the lexical definition of a string literal:

string     ::=  "'" stringitem* "'"
stringitem ::=  stringchar | escapeseq
stringchar ::=  <any source character except "\" or newline or the quote>
escapeseq  ::=  "\" <any source character>

It is clear that string literals (raw or not) containing a bare quote character: 'hello'world' or ending with a backslash: 'hello world\' are not valid.

User · Answer

There are two types of string in python  the traditional str type and the newer unicode type  If you type a string literal without the u in front you get the old str type which stores 8-bit characters  and with the u in front you get the newer unicode type that can store any Unicode character   The r doesn t change the type at all  it just changes how the string literal is interpreted  Without the r  backslashes are treated as escape characters  With the r  backslashes are treated as literal  Either way  the type is the same   ur is of course a Unicode string where backslashes are literal backslashes  not part of escape codes   You can try to convert a Unicode string to an old string using the str   function  but if there are any unicode characters that cannot be represented in the old string  you will get an exception  You could replace them with question marks first if you wish  but of course this would cause those characters to be unreadable  It is not recommended to use the str type if you want to correctly handle unicode characters

User · Answer

raw string  means it is stored as it appears  For example      is just a backslash instead of an escaping

User · Answer

A  u  prefix denotes the value has type unicode rather than str   Raw string literals  with an  r  prefix  escape any escape sequences within them  so len r  n   is 2   Because they escape escape sequences  you cannot end a string literal with a single backslash  that s not a valid escape sequence  e g  r        Raw  is not part of the type  it s merely one way to represent the value   For example     n  and r  n  are identical values  just like 32  0x20  and 0b100000 are identical   You can have unicode raw string literals    gt  gt  gt  u   ur  n   gt  gt  gt  print type u   len u   lt type  unicode  gt  2   The source file encoding just determines how to interpret the source file  it doesn t affect expressions or types otherwise   However  it s recommended to avoid code where an encoding other than ASCII would change the meaning      Files using ASCII  or UTF-8  for Python 3 0  should not have a coding cookie   Latin-1  or UTF-8  should only be used when a comment or docstring needs to mention an author name that requires Latin-1  otherwise  using  x   u or  U escapes is the preferred way to include non-ASCII data in string literals

User · Answer

Let me explain it simply  In python 2  you can store string in 2 different types   The first one is ASCII which is str type in python  it uses 1 byte of memory   256 characters  will store mostly English alphabets and simple symbols   The 2nd type is UNICODE which is unicode type in python  Unicode stores all types of languages   By default  python will prefer str type but if you want to store string in unicode type you can put u in front of the text like u text  or you can do this by calling unicode  text    So u is just a short way to call a function to cast str to unicode  That s it   Now the r part  you put it in front of the text to tell the computer that the text is raw text  backslash should not be an escaping character  r  n  will not create a new line character  It s just plain text containing 2 characters   If you want to convert str to unicode and also put raw text in there  use ur because ru will raise an error   NOW  the important part   You cannot store one backslash by using r  it s the only exception  So this code will produce error  r     To store a backslash  only one  you need to use       If you want to store more than 1 characters you can still use r like r     will produce 2 backslashes as you expected   I don t know the reason why r doesn t work with one backslash storage but the reason isn t described by anyone yet  I hope that it is a bug

User · Answer

There s not really any  raw string   there are raw string literals  which are exactly the string literals marked by an  r  before the opening quote   A  raw string literal  is a slightly different syntax for a string literal  in which a backslash     is taken as meaning  just a backslash   except when it comes right before a quote that would otherwise terminate the literal  -- no  escape sequences  to represent newlines  tabs  backspaces  form-feeds  and so on   In normal string literals  each backslash must be doubled up to avoid being taken as the start of an escape sequence   This syntax variant exists mostly because the syntax of regular expression patterns is heavy with backslashes  but never at the end  so the  except  clause above doesn t matter  and it looks a bit better when you avoid doubling up each of them -- that s all   It also gained some popularity to express native Windows file paths  with backslashes instead of regular slashes like on other platforms   but that s very rarely needed  since normal slashes mostly work fine on Windows too  and imperfect  due to the  except  clause above    r      is a byte string  in Python 2     ur      is a Unicode string  again  in Python 2     and any of the other three kinds of quoting also produces exactly the same types of strings  so for example r       r           r       r          are all byte strings  and so on    Not sure what you mean by  going back  - there is no intrinsically back and forward directions  because there s no raw string type  it s just an alternative syntax to express perfectly normal string objects  byte or unicode as they may be   And yes  in Python 2    u      is of course always distinct from just       -- the former is a unicode string  the latter is a byte string  What encoding the literal might be expressed in is a completely orthogonal issue   E g   consider  Python 2 6     gt  gt  gt  sys getsizeof  ciao   28  gt  gt  gt  sys getsizeof u ciao   34   The Unicode object of course takes more memory space  very small difference for a very short string  obviously  -

User · Answer

Maybe this is obvious  maybe not  but you can make the string     by calling x chr 92   x chr 92  print type x   len x     lt type  str  gt  1 y      print type y   len y     lt type  str  gt  1 x  y     True x is y   False

[python] What exactly do "u" and "r" string flags do, and what are raw string literals?

Unicode string literals

Raw string literals

Raw string literals are not completely "raw"?

Examples related to python

Examples related to unicode

Examples related to python-2.x

Examples related to rawstring