Python str vs unicode types

Question

Working with Python 2 7  I m wondering what real advantage there is in using the type unicode instead of str  as both of them seem to be able to hold Unicode strings  Is there any special reason apart from being able to set Unicode codes in unicode strings using the escape char      Executing a module with     - - coding  utf-8 - -  a        ua   u     print a  ua   Results in          EDIT   More testing using Python shell    gt  gt  gt  a         gt  gt  gt  a   xc3 xa1   gt  gt  gt  ua   u      gt  gt  gt  ua u  xe1   gt  gt  gt  ua encode  utf8     xc3 xa1   gt  gt  gt  ua encode  latin1     xe1   gt  gt  gt  ua u  xe1    So  the unicode string seems to be encoded using latin1 instead of utf-8 and the raw string is encoded using utf-8  I m even more confused now   S

User · Answer

Unicode and encodings are completely different, unrelated things.

Unicode

Assigns a numeric ID to each character:

0x41 ? A
0xE1 ? á
0x414 ? ?

So, Unicode assigns the number 0x41 to A, 0xE1 to á, and 0x414 to ?.

Even the little arrow ? I used has its Unicode number, it's 0x2192. And even emojis have their Unicode numbers, is 0x1F602.

You can look up the Unicode numbers of all characters in this table. In particular, you can find the first three characters above here, the arrow here, and the emoji here.

These numbers assigned to all characters by Unicode are called code points.

The purpose of all this is to provide a means to unambiguously refer to a each character. For example, if I'm talking about , instead of saying "you know, this laughing emoji with tears", I can just say, Unicode code point 0x1F602. Easier, right?

Note that Unicode code points are usually formatted with a leading U+, then the hexadecimal numeric value padded to at least 4 digits. So, the above examples would be U+0041, U+00E1, U+0414, U+2192, U+1F602.

Unicode code points range from U+0000 to U+10FFFF. That is 1,114,112 numbers. 2048 of these numbers are used for surrogates, thus, there remain 1,112,064. This means, Unicode can assign a unique ID (code point) to 1,112,064 distinct characters. Not all of these code points are assigned to a character yet, and Unicode is extended continuously (for example, when new emojis are introduced).

The important thing to remember is that all Unicode does is to assign a numerical ID, called code point, to each character for easy and unambiguous reference.

Encodings

Map characters to bit patterns.

These bit patterns are used to represent the characters in computer memory or on disk.

There are many different encodings that cover different subsets of characters. In the English-speaking world, the most common encodings are the following:

ASCII

Maps 128 characters (code points U+0000 to U+007F) to bit patterns of length 7.

Example:

a ? 1100001 (0x61)

You can see all the mappings in this table.

ISO 8859-1 (aka Latin-1)

Maps 191 characters (code points U+0020 to U+007E and U+00A0 to U+00FF) to bit patterns of length 8.

Example:

a ? 01100001 (0x61)
á ? 11100001 (0xE1)

You can see all the mappings in this table.

UTF-8

Maps 1,112,064 characters (all existing Unicode code points) to bit patterns of either length 8, 16, 24, or 32 bits (that is, 1, 2, 3, or 4 bytes).

Example:

a ? 01100001 (0x61)
á ? 11000011 10100001 (0xC3 0xA1)
? ? 11100010 10001001 10100000 (0xE2 0x89 0xA0)
? 11110000 10011111 10011000 10000010 (0xF0 0x9F 0x98 0x82)

The way UTF-8 encodes characters to bit strings is very well described here.

Unicode and Encodings

Looking at the above examples, it becomes clear how Unicode is useful.

For example, if I'm Latin-1 and I want to explain my encoding of á, I don't need to say:

"I encode that a with an aigu (or however you call that rising bar) as 11100001"

But I can just say:

"I encode U+00E1 as 11100001"

And if I'm UTF-8, I can say:

"Me, in turn, I encode U+00E1 as 11000011 10100001"

And it's unambiguously clear to everybody which character we mean.

Now to the often arising confusion

It's true that sometimes the bit pattern of an encoding, if you interpret it as a binary number, is the same as the Unicode code point of this character.

For example:

ASCII encodes a as 1100001, which you can interpret as the hexadecimal number 0x61, and the Unicode code point of a is U+0061.
Latin-1 encodes á as 11100001, which you can interpret as the hexadecimal number 0xE1, and the Unicode code point of á is U+00E1.

Of course, this has been arranged like this on purpose for convenience. But you should look at it as a pure coincidence. The bit pattern used to represent a character in memory is not tied in any way to the Unicode code point of this character.

Nobody even says that you have to interpret a bit string like 11100001 as a binary number. Just look at it as the sequence of bits that Latin-1 uses to encode the character á.

Back to your question

The encoding used by your Python interpreter is UTF-8.

Here's what's going on in your examples:

Example 1

The following encodes the character á in UTF-8. This results in the bit string 11000011 10100001, which is saved in the variable a.

>>> a = 'á'

When you look at the value of a, its content 11000011 10100001 is formatted as the hex number 0xC3 0xA1 and output as '\xc3\xa1':

>>> a
'\xc3\xa1'

Example 2

The following saves the Unicode code point of á, which is U+00E1, in the variable ua (we don't know which data format Python uses internally to represent the code point U+00E1 in memory, and it's unimportant to us):

>>> ua = u'á'

When you look at the value of ua, Python tells you that it contains the code point U+00E1:

>>> ua
u'\xe1'

Example 3

The following encodes Unicode code point U+00E1 (representing character á) with UTF-8, which results in the bit pattern 11000011 10100001. Again, for output this bit pattern is represented as the hex number 0xC3 0xA1:

>>> ua.encode('utf-8')
'\xc3\xa1'

Example 4

The following encodes Unicode code point U+00E1 (representing character á) with Latin-1, which results in the bit pattern 11100001. For output, this bit pattern is represented as the hex number 0xE1, which by coincidence is the same as the initial code point U+00E1:

>>> ua.encode('latin1')
'\xe1'

There's no relation between the Unicode object ua and the Latin-1 encoding. That the code point of á is U+00E1 and the Latin-1 encoding of á is 0xE1 (if you interpret the bit pattern of the encoding as a binary number) is a pure coincidence.

User · Answer

Your terminal happens to be configured to UTF-8   The fact that printing a works is a coincidence  you are writing raw UTF-8 bytes to the terminal  a is a value of length two  containing two bytes  hex values C3 and A1  while ua is a unicode value of length one  containing a codepoint U 00E1   This difference in length is one major reason to use Unicode values  you cannot easily measure the number of text characters in a byte string  the len   of a byte string tells you how many bytes were used  not how many characters were encoded   You can see the difference when you encode the unicode value to different output encodings    gt  gt  gt  a         gt  gt  gt  ua   u      gt  gt  gt  ua encode  utf8     xc3 xa1   gt  gt  gt  ua encode  latin1     xe1   gt  gt  gt  a   xc3 xa1    Note that the first 256 codepoints of the Unicode standard match the Latin 1 standard  so the U 00E1 codepoint is encoded to Latin 1 as a byte with hex value E1   Furthermore  Python uses escape codes in representations of unicode and byte strings alike  and low code points that are not printable ASCII are represented using  x   escape values as well  This is why a Unicode string with a code point between 128 and 255 looks just like the Latin 1 encoding  If you have a unicode string with codepoints beyond U 00FF a different escape sequence   u     is used instead  with a four-digit hex value   It looks like you don t yet fully understand what the difference is between Unicode and an encoding  Please do read the following articles before you continue    The Absolute Minimum Every Software Developer Absolutely  Positively Must Know About Unicode and Character Sets  No Excuses   by Joel Spolsky The Python Unicode HOWTO Pragmatic Unicode by Ned Batchelder

User · Answer

When you define a as unicode  the chars a and    are equal  Otherwise    counts as two chars  Try len a  and len au   In addition to that  you may need to have the encoding when you work with other environments  For example if you use md5  you get different values for a and ua

User · Answer

unicode is meant to handle text  Text is a sequence of code points which may be bigger than a single byte  Text can be encoded in a specific encoding to represent the text as raw bytes e g  utf-8  latin-1       Note that unicode is not encoded  The internal representation used by python is an implementation detail  and you shouldn t care about it as long as it is able to represent the code points you want   On the contrary str in Python 2 is a plain sequence of bytes  It does not represent text   You can think of unicode as a general representation of some text  which can be encoded in many different ways into a sequence of binary data represented via str   Note  In Python 3  unicode was renamed to str and there is a new bytes type for a plain sequence of bytes   Some differences that you can see    gt  gt  gt  len u         a single code point 1  gt  gt  gt  len           by default utf-8 - gt  takes two bytes 2  gt  gt  gt  len u     encode  utf-8    2  gt  gt  gt  len u     encode  latin1       in latin1 it takes one byte 1  gt  gt  gt  print u     encode  utf-8      terminal encoding is utf-8     gt  gt  gt  print u     encode  latin1     it cannot understand the latin1 byte     Note that using str you have a lower-level control on the single bytes of a specific encoding representation  while using unicode you can only control at the code-point level  For example you can do    gt  gt  gt                 xc3 xa0 xc3 xa8 xc3 xac xc3 xb2 xc3 xb9   gt  gt  gt  print              replace   xa8                   What before was valid UTF-8  isn t anymore  Using a unicode string you cannot operate in such a way that the resulting string isn t valid unicode text  You can remove a code point  replace a code point with a different code point etc  but you cannot mess with the internal representation

[python] Python str vs unicode types

Unicode

Encodings

ASCII

ISO 8859-1 (aka Latin-1)

UTF-8

Unicode and Encodings

Now to the often arising confusion

Back to your question

Example 1

Example 2

Example 3

Example 4

Examples related to python

Examples related to string

Examples related to unicode