Python Converting from ISO-8859-1 latin1 to UTF-8

Question

I have this string that has been decoded from Quoted-printable to ISO-8859-1 with the email module  This gives me strings like   xC4pple  which would correspond to    pple   Apple in Swedish   However  I can t convert those strings to UTF-8    gt  gt  gt  apple     xC4pple   gt  gt  gt  apple   xc4pple   gt  gt  gt  apple encode  UTF-8   Traceback  most recent call last     File   lt stdin gt    line 1  in  lt module gt  UnicodeDecodeError   ascii  codec can t decode byte 0xc4 in position 0  ordinal not in     range 128    What should I do

User · Answer

Decode to Unicode  encode the results to UTF8    apple decode  latin1   encode  utf8

User · Answer

concept   concept encode  ascii    ignore    concept   MySQLdb escape string concept decode  latin1   encode  utf8   rstrip      I do this  I am not sure if that is a good approach but it works everytime

User · Answer

This is a common problem  so here s a relatively thorough illustration   For non-unicode strings  i e  those without u prefix like u  xc4pple    one must decode from the native encoding  iso8859-1 latin1  unless modified with the enigmatic sys setdefaultencoding function  to unicode  then encode to a character set that can display the characters you wish  in this case I d recommend UTF-8   First  here is a handy utility function that ll help illuminate the patterns of Python 2 7 string and unicode    gt  gt  gt  def tell me about s   return  type s   s    A plain string   gt  gt  gt  v     xC4pple    iso-8859-1 aka latin1 encoded string   gt  gt  gt  tell me about v    lt type  str  gt     xc4pple     gt  gt  gt  v   xc4pple           representation in memory   gt  gt  gt  print v  pple               map the iso-8859-1 in-memory to iso-8859-1 chars                     note that   xc4  has no representation in iso-8859-1                       so is printed as        Decoding a iso8859-1 string - convert plain string to unicode   gt  gt  gt  uv   v decode  iso-8859-1    gt  gt  gt  uv u  xc4pple          decoding iso-8859-1 becomes unicode  in memory   gt  gt  gt  tell me about uv    lt type  unicode  gt   u  xc4pple     gt  gt  gt  print v decode  iso-8859-1     pple               convert unicode to the default character set                      utf-8  based on sys stdout encoding    gt  gt  gt  v decode  iso-8859-1      u  xc4pple  True                one could have just used a unicode representation                      from the start   A little more illustration     with            gt  gt  gt  u        u  xc4  True                the native unicode char and escaped versions are the same   gt  gt  gt          u  xc4    False               the native unicode char is   xc3 x84  in latin1   gt  gt  gt       decode  utf8      u  xc4  True                one can decode the string to get unicode   gt  gt  gt            xc4  False               the native character and the escaped string are                     of course not equal    xc3 x84       xc4      Encoding to UTF   gt  gt  gt  u8   v decode  iso-8859-1   encode  utf-8    gt  gt  gt  u8   xc3 x84pple       convert iso-8859-1 to unicode to utf-8   gt  gt  gt  tell me about u8    lt type  str  gt     xc3 x84pple     gt  gt  gt  u16   v decode  iso-8859-1   encode  utf-16    gt  gt  gt  tell me about u16    lt type  str  gt     xff xfe xc4 x00p x00p x00l x00e x00     gt  gt  gt  tell me about u8 decode  utf8      lt type  unicode  gt   u  xc4pple     gt  gt  gt  tell me about u16 decode  utf16      lt type  unicode  gt   u  xc4pple     Relationship between unicode and UTF and latin1   gt  gt  gt  print u8   pple               printing utf-8 - because of the encoding we now know                     how to print the characters   gt  gt  gt  print u8 decode  utf-8     printing unicode   pple   gt  gt  gt  print u16       printing  bytes  of u16    pple   gt  gt  gt  print u16 decode  utf16     pple               printing unicode   gt  gt  gt  v    u8 False               v is a iso8859-1 string  u8 is a utf-8 string   gt  gt  gt  v decode  iso8859-1      u8 False               v decode      returns unicode   gt  gt  gt  u8 decode  utf-8      v decode  latin1      u16 decode  utf-16   True                all decode to the same unicode memory representation                      latin1 is iso-8859-1    Unicode Exceptions    gt  gt  gt  u8 encode  iso8859-1   Traceback  most recent call last     File   lt stdin gt    line 1  in  lt module gt  UnicodeDecodeError   ascii  codec can t decode byte 0xc3 in position 0    ordinal not in range 128    gt  gt  gt  u16 encode  iso8859-1   Traceback  most recent call last     File   lt stdin gt    line 1  in  lt module gt  UnicodeDecodeError   ascii  codec can t decode byte 0xff in position 0    ordinal not in range 128    gt  gt  gt  v encode  iso8859-1   Traceback  most recent call last     File   lt stdin gt    line 1  in  lt module gt  UnicodeDecodeError   ascii  codec can t decode byte 0xc4 in position 0    ordinal not in range 128    One would get around these by converting from the specific encoding  latin-1  utf8  utf16  to unicode e g  u8 decode  utf8   encode  latin1     So perhaps one could draw the following principles and generalizations    a type str is a set of bytes  which may have one of a number of encodings such as Latin-1  UTF-8  and UTF-16 a type unicode is a set of bytes that can be converted to any number of encodings  most commonly UTF-8 and latin-1  iso8859-1  the print command has its own logic for encoding  set to sys stdout encoding and defaulting to UTF-8 One must decode a str to unicode before converting to another encoding    Of course  all of this changes in Python 3 x   Hope that is illuminating   Further reading   Characters vs  Bytes  by Tim Bray    And the very illustrative rants by Armin Ronacher    The Updated Guide to Unicode on Python  July 2  2013  More About Unicode in Python 2 and 3  January 5  2014  UCS vs UTF-8 as Internal String Encoding  January 9  2014  Everything you did not want to know about Unicode in Python 3  May 12  2014

User · Answer

Try decoding it first  then encoding   apple decode  iso-8859-1   encode  utf8

User · Answer

For Python 3   bytes apple  iso-8859-1   decode  utf-8     I used this for a text incorrectly encoded as iso-8859-1  showing words like Ve   x99ejn      instead of utf-8  This code produces correct version Verejn

[python] Python: Converting from ISO-8859-1/latin1 to UTF-8

Examples related to python

Examples related to character-encoding