Always encode from unicode to bytes.
In this direction, you get to choose the encoding.
>>> u"??".encode("utf8")
'\xe4\xbd\xa0\xe5\xa5\xbd'
>>> print _
??
The other way is to decode from bytes to unicode.
In this direction, you have to know what the encoding is.
>>> bytes = '\xe4\xbd\xa0\xe5\xa5\xbd'
>>> print bytes
??
>>> bytes.decode('utf-8')
u'\u4f60\u597d'
>>> print _
??
This point can't be stressed enough. If you want to avoid playing unicode "whack-a-mole", it's important to understand what's happening at the data level. Here it is explained another way:
decode
on it.encode
on it.Now, on seeing .encode
on a byte string, Python 2 first tries to implicitly convert it to text (a unicode
object). Similarly, on seeing .decode
on a unicode string, Python 2 implicitly tries to convert it to bytes (a str
object).
These implicit conversions are why you can get Unicode
Decode
Error
when you've called encode
. It's because encoding usually accepts a parameter of type unicode
; when receiving a str
parameter, there's an implicit decoding into an object of type unicode
before re-encoding it with another encoding. This conversion chooses a default 'ascii' decoder†, giving you the decoding error inside an encoder.
In fact, in Python 3 the methods str.decode
and bytes.encode
don't even exist. Their removal was a [controversial] attempt to avoid this common confusion.
† ...or whatever coding sys.getdefaultencoding()
mentions; usually this is 'ascii'