I've never been sure that I understand the difference between str/unicode decode and encode.
I know that str().decode()
is for when you have a string of bytes that you know has a certain character encoding, given that encoding name it will return a unicode string.
I know that unicode().encode()
converts unicode chars into a string of bytes according to a given encoding name.
But I don't understand what str().encode()
and unicode().decode()
are for. Can anyone explain, and possibly also correct anything else I've gotten wrong above?
EDIT:
Several answers give info on what .encode
does on a string, but no-one seems to know what .decode
does for unicode.
This question is related to
python
string
unicode
character-encoding
python-2.x
There are a few encodings that can be used to de-/encode from str to str or from unicode to unicode. For example base64, hex or even rot13. They are listed in the codecs module.
Edit:
The decode message on a unicode string can undo the corresponding encode operation:
In [1]: u'0a'.decode('hex')
Out[1]: '\n'
The returned type is str instead of unicode which is unfortunate in my opinion. But when you are not doing a proper en-/decode between str and unicode this looks like a mess anyway.
mybytestring.encode(somecodec) is meaningful for these values of somecodec
:
I am not sure what decoding an already decoded unicode text is good for. Trying that with any encoding seems to always try to encode with the system's default encoding first.
anUnicode.encode('encoding') results in a string object and can be called on a unicode object
aString.decode('encoding') results in an unicode object and can be called on a string, encoded in given encoding.
Some more explanations:
You can create some unicode object, which doesn't have any encoding set. The way it is stored by Python in memory is none of your concern. You can search it, split it and call any string manipulating function you like.
But there comes a time, when you'd like to print your unicode object to console or into some text file. So you have to encode it (for example - in UTF-8), you call encode('utf-8') and you get a string with '\u<someNumber>' inside, which is perfectly printable.
Then, again - you'd like to do the opposite - read string encoded in UTF-8 and treat it as an Unicode, so the \u360 would be one character, not 5. Then you decode a string (with selected encoding) and get brand new object of the unicode type.
Just as a side note - you can select some pervert encoding, like 'zip', 'base64', 'rot' and some of them will convert from string to string, but I believe the most common case is one that involves UTF-8/UTF-16 and string.
There are a few encodings that can be used to de-/encode from str to str or from unicode to unicode. For example base64, hex or even rot13. They are listed in the codecs module.
Edit:
The decode message on a unicode string can undo the corresponding encode operation:
In [1]: u'0a'.decode('hex')
Out[1]: '\n'
The returned type is str instead of unicode which is unfortunate in my opinion. But when you are not doing a proper en-/decode between str and unicode this looks like a mess anyway.
To represent a unicode string as a string of bytes is known as encoding. Use u'...'.encode(encoding)
.
Example:
>>> u'æøå'.encode('utf8') '\xc3\x83\xc2\xa6\xc3\x83\xc2\xb8\xc3\x83\xc2\xa5' >>> u'æøå'.encode('latin1') '\xc3\xa6\xc3\xb8\xc3\xa5' >>> u'æøå'.encode('ascii') UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128)
You typically encode a unicode string whenever you need to use it for IO, for instance transfer it over the network, or save it to a disk file.
To convert a string of bytes to a unicode string is known as decoding. Use unicode('...', encoding)
or '...'.decode(encoding).
Example:
>>> u'æøå' u'\xc3\xa6\xc3\xb8\xc3\xa5' # the interpreter prints the unicode object like so >>> unicode('\xc3\xa6\xc3\xb8\xc3\xa5', 'latin1') u'\xc3\xa6\xc3\xb8\xc3\xa5' >>> '\xc3\xa6\xc3\xb8\xc3\xa5'.decode('latin1') u'\xc3\xa6\xc3\xb8\xc3\xa5'
You typically decode a string of bytes whenever you receive string data from the network or from a disk file.
I believe there are some changes in unicode handling in python 3, so the above is probably not correct for python 3.
Some good links:
mybytestring.encode(somecodec) is meaningful for these values of somecodec
:
I am not sure what decoding an already decoded unicode text is good for. Trying that with any encoding seems to always try to encode with the system's default encoding first.
There are a few encodings that can be used to de-/encode from str to str or from unicode to unicode. For example base64, hex or even rot13. They are listed in the codecs module.
Edit:
The decode message on a unicode string can undo the corresponding encode operation:
In [1]: u'0a'.decode('hex')
Out[1]: '\n'
The returned type is str instead of unicode which is unfortunate in my opinion. But when you are not doing a proper en-/decode between str and unicode this looks like a mess anyway.
mybytestring.encode(somecodec) is meaningful for these values of somecodec
:
I am not sure what decoding an already decoded unicode text is good for. Trying that with any encoding seems to always try to encode with the system's default encoding first.
anUnicode.encode('encoding') results in a string object and can be called on a unicode object
aString.decode('encoding') results in an unicode object and can be called on a string, encoded in given encoding.
Some more explanations:
You can create some unicode object, which doesn't have any encoding set. The way it is stored by Python in memory is none of your concern. You can search it, split it and call any string manipulating function you like.
But there comes a time, when you'd like to print your unicode object to console or into some text file. So you have to encode it (for example - in UTF-8), you call encode('utf-8') and you get a string with '\u<someNumber>' inside, which is perfectly printable.
Then, again - you'd like to do the opposite - read string encoded in UTF-8 and treat it as an Unicode, so the \u360 would be one character, not 5. Then you decode a string (with selected encoding) and get brand new object of the unicode type.
Just as a side note - you can select some pervert encoding, like 'zip', 'base64', 'rot' and some of them will convert from string to string, but I believe the most common case is one that involves UTF-8/UTF-16 and string.
There are a few encodings that can be used to de-/encode from str to str or from unicode to unicode. For example base64, hex or even rot13. They are listed in the codecs module.
Edit:
The decode message on a unicode string can undo the corresponding encode operation:
In [1]: u'0a'.decode('hex')
Out[1]: '\n'
The returned type is str instead of unicode which is unfortunate in my opinion. But when you are not doing a proper en-/decode between str and unicode this looks like a mess anyway.
To represent a unicode string as a string of bytes is known as encoding. Use u'...'.encode(encoding)
.
Example:
>>> u'æøå'.encode('utf8') '\xc3\x83\xc2\xa6\xc3\x83\xc2\xb8\xc3\x83\xc2\xa5' >>> u'æøå'.encode('latin1') '\xc3\xa6\xc3\xb8\xc3\xa5' >>> u'æøå'.encode('ascii') UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128)
You typically encode a unicode string whenever you need to use it for IO, for instance transfer it over the network, or save it to a disk file.
To convert a string of bytes to a unicode string is known as decoding. Use unicode('...', encoding)
or '...'.decode(encoding).
Example:
>>> u'æøå' u'\xc3\xa6\xc3\xb8\xc3\xa5' # the interpreter prints the unicode object like so >>> unicode('\xc3\xa6\xc3\xb8\xc3\xa5', 'latin1') u'\xc3\xa6\xc3\xb8\xc3\xa5' >>> '\xc3\xa6\xc3\xb8\xc3\xa5'.decode('latin1') u'\xc3\xa6\xc3\xb8\xc3\xa5'
You typically decode a string of bytes whenever you receive string data from the network or from a disk file.
I believe there are some changes in unicode handling in python 3, so the above is probably not correct for python 3.
Some good links:
The simple answer is that they are the exact opposite of each other.
The computer uses the very basic unit of byte to store and process information; it is meaningless for human eyes.
For example,'\xe4\xb8\xad\xe6\x96\x87' is the representation of two Chinese characters, but the computer only knows (meaning print or store) it is Chinese Characters when they are given a dictionary to look for that Chinese word, in this case, it is a "utf-8" dictionary, and it would fail to correctly show the intended Chinese word if you look into a different or wrong dictionary (using a different decoding method).
In the above case, the process for a computer to look for Chinese word is decode()
.
And the process of computer writing the Chinese into computer memory is encode()
.
So the encoded information is the raw bytes, and the decoded information is the raw bytes and the name of the dictionary to reference (but not the dictionary itself).
anUnicode.encode('encoding') results in a string object and can be called on a unicode object
aString.decode('encoding') results in an unicode object and can be called on a string, encoded in given encoding.
Some more explanations:
You can create some unicode object, which doesn't have any encoding set. The way it is stored by Python in memory is none of your concern. You can search it, split it and call any string manipulating function you like.
But there comes a time, when you'd like to print your unicode object to console or into some text file. So you have to encode it (for example - in UTF-8), you call encode('utf-8') and you get a string with '\u<someNumber>' inside, which is perfectly printable.
Then, again - you'd like to do the opposite - read string encoded in UTF-8 and treat it as an Unicode, so the \u360 would be one character, not 5. Then you decode a string (with selected encoding) and get brand new object of the unicode type.
Just as a side note - you can select some pervert encoding, like 'zip', 'base64', 'rot' and some of them will convert from string to string, but I believe the most common case is one that involves UTF-8/UTF-16 and string.
The simple answer is that they are the exact opposite of each other.
The computer uses the very basic unit of byte to store and process information; it is meaningless for human eyes.
For example,'\xe4\xb8\xad\xe6\x96\x87' is the representation of two Chinese characters, but the computer only knows (meaning print or store) it is Chinese Characters when they are given a dictionary to look for that Chinese word, in this case, it is a "utf-8" dictionary, and it would fail to correctly show the intended Chinese word if you look into a different or wrong dictionary (using a different decoding method).
In the above case, the process for a computer to look for Chinese word is decode()
.
And the process of computer writing the Chinese into computer memory is encode()
.
So the encoded information is the raw bytes, and the decoded information is the raw bytes and the name of the dictionary to reference (but not the dictionary itself).
To represent a unicode string as a string of bytes is known as encoding. Use u'...'.encode(encoding)
.
Example:
>>> u'æøå'.encode('utf8') '\xc3\x83\xc2\xa6\xc3\x83\xc2\xb8\xc3\x83\xc2\xa5' >>> u'æøå'.encode('latin1') '\xc3\xa6\xc3\xb8\xc3\xa5' >>> u'æøå'.encode('ascii') UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128)
You typically encode a unicode string whenever you need to use it for IO, for instance transfer it over the network, or save it to a disk file.
To convert a string of bytes to a unicode string is known as decoding. Use unicode('...', encoding)
or '...'.decode(encoding).
Example:
>>> u'æøå' u'\xc3\xa6\xc3\xb8\xc3\xa5' # the interpreter prints the unicode object like so >>> unicode('\xc3\xa6\xc3\xb8\xc3\xa5', 'latin1') u'\xc3\xa6\xc3\xb8\xc3\xa5' >>> '\xc3\xa6\xc3\xb8\xc3\xa5'.decode('latin1') u'\xc3\xa6\xc3\xb8\xc3\xa5'
You typically decode a string of bytes whenever you receive string data from the network or from a disk file.
I believe there are some changes in unicode handling in python 3, so the above is probably not correct for python 3.
Some good links:
Source: Stackoverflow.com