string encoding and decoding

Question

Here are my attempts with error messages  What am I doing wrong   string decode  ascii    ignore        UnicodeEncodeError   ascii  codec can t encode character u  xa0  in   position 37  ordinal not in range 128    string encode  utf-8    ignore        UnicodeDecodeError   ascii  codec can t decode byte 0xc2 in position   37  ordinal not in range 128

User · Accepted Answer

You can t decode a unicode  and you can t encode a str  Try doing it the other way around

User · Answer

Guessing at all the things omitted from the original question  but  assuming Python 2 x the key is to read the error messages carefully  in particular where you call  encode  but the message says  decode  and vice versa  but also the types of the values included in the messages   In the first example string is of type unicode and you attempted to decode it which is an operation converting a byte string to unicode  Python helpfully attempted to convert the unicode value to str using the default  ascii  encoding but since your string contained a non-ascii character you got the error which says that Python was unable to encode a unicode value  Here s an example which shows the type of the input string    gt  gt  gt  u  xa0  decode  ascii    ignore    Traceback  most recent call last     File   lt pyshell 7 gt    line 1  in  lt module gt      u  xa0  decode  ascii    ignore   UnicodeEncodeError   ascii  codec can t encode character u  xa0  in position 0  ordinal not in range 128    In the second case you do the reverse attempting to encode a byte string  Encoding is an operation that converts unicode to a byte string so Python helpfully attempts to convert your byte string to unicode first and  since you didn t give it an ascii string the default ascii decoder fails    gt  gt  gt    xc2  encode  ascii    ignore    Traceback  most recent call last     File   lt pyshell 6 gt    line 1  in  lt module gt        xc2  encode  ascii    ignore   UnicodeDecodeError   ascii  codec can t decode byte 0xc2 in position 0  ordinal not in range 128

User · Answer

Aside from getting decode and encode backwards  I think part of the answer here is actually don t use the ascii encoding  It s probably not what you want   To begin with  think of str like you would a plain text file  It s just a bunch of bytes with no encoding actually attached to it  How it s interpreted is up to whatever piece of code is reading it  If you don t know what this paragraph is talking about  go read Joel s The Absolute Minimum Every Software Developer Absolutely  Positively Must Know About Unicode and Character Sets right now before you go any further   Naturally  we re all aware of the mess that created  The answer is to  at least within memory  have a standard encoding for all strings  That s where unicode comes in  I m having trouble tracking down exactly what encoding Python uses internally for sure  but it doesn t really matter just for this  The point is that you know it s a sequence of bytes that are interpreted a certain way  So you only need to think about the characters themselves  and not the bytes   The problem is that in practice  you run into both  Some libraries give you a str  and some expect a str  Certainly that makes sense whenever you re streaming a series of bytes  such as to or from disk or over a web request   So you need to be able to translate back and forth   Enter codecs  it s the translation library between these two data types  You use encode to generate a sequence of bytes  str  from a text string  unicode   and you use decode to get a text string  unicode  from a sequence of bytes  str    For example    gt  gt  gt  s    I look like a string  but I m actually a sequence of bytes   xe2 x9d xa4   gt  gt  gt  codecs decode s   utf-8   u I look like a string  but I m actually a sequence of bytes   u2764    What happened here  I gave Python a sequence of bytes  and then I told it   Give me the unicode version of this  given that this sequence of bytes is in  utf-8    It did as I asked  and those bytes  a heart character  are now treated as a whole  represented by their Unicode codepoint   Let s go the other way around    gt  gt  gt  u   u I m a string  Really   u2764   gt  gt  gt  codecs encode u   utf-8    I m a string  Really   xe2 x9d xa4    I gave Python a Unicode string  and I asked it to translate the string into a sequence of bytes using the  utf-8  encoding  So it did  and now the heart is just a bunch of bytes it can t print as ASCII  so it shows me the hexadecimal instead   We can work with other encodings  too  of course    gt  gt  gt  s    I have a section  xa7   gt  gt  gt  codecs decode s   latin1   u I have a section  xa7   gt  gt  gt  codecs decode s   latin1   -1     u  u00A7  True   gt  gt  gt  u   u I have a section  u00a7   gt  gt  gt  u u I have a section  xa7   gt  gt  gt  codecs encode u   latin1    I have a section  xa7       xa7  is the section character  in both  Unicode and Latin-1    So for your question  you first need to figure out what encoding your str is in    Did it come from a file  From a web request  From your database  Then the source determines the encoding  Find out the encoding of the source and use that to translate it into a unicode   s    get from external source  u   codecs decode s   utf-8     Replace utf-8 with the actual input encoding  Or maybe you re trying to write it out somewhere  What encoding does the destination expect  Use that to translate it into a str  UTF-8 is a good choice for plain text documents  most things can read it   u   u My string  s   codecs encode u   utf-8     Replace utf-8 with the actual output encoding  Write s out somewhere   Are you just translating back and forth in memory for interoperability or something  Then just pick an encoding and stick with it   utf-8  is probably the best choice for that   u   u My string  s   codecs encode u   utf-8   newu   codecs decode s   utf-8      In modern programming  you probably never want to use the  ascii  encoding for any of this  It s an extremely small subset of all possible characters  and no system I know of uses it by default or anything   Python 3 does its best to make this immensely clearer simply by changing the names  In Python 3  str was replaced with bytes  and unicode was replaced with str

User · Answer

That s because your input string can   t be converted according to the encoding rules  strict by default    I don t know  but I always encoded using directly unicode   constructor  at least that s the ways at the official documentation    unicode your str  errors  ignore

[python] string encoding and decoding?

Examples related to python

Examples related to python-2.7