Convert Unicode to ASCII without errors in Python

Question

My code just scrapes a web page  then converts it to Unicode   html   urllib urlopen link  read   html encode  utf8   ignore   self response out write html    But I get a UnicodeDecodeError     Traceback  most recent call last     File   Applications GoogleAppEngineLauncher app Contents Resources GoogleAppEngine-default bundle Contents Resources google appengine google appengine ext webapp   init   py   line 507  in   call       handler get  groups    File   Users greg clounce main py   line 55  in get     html encode  utf8   ignore   UnicodeDecodeError   ascii  codec can t decode byte 0xa0 in position 2818  ordinal not in range 128    I assume that means the HTML contains some wrongly-formed attempt at Unicode somewhere  Can I just drop whatever code bytes are causing the problem instead of getting an error

User · Answer

For broken consoles like cmd exe and HTML output you can always use   my unicode string encode  ascii   xmlcharrefreplace     This will preserve all the non-ascii chars while making them printable in pure ASCII and in HTML   WARNING  If you use this in production code to avoid errors then most likely there is something wrong in your code  The only valid use case for this is printing to a non-unicode console or easy conversion to HTML entities in an HTML context    And finally  if you are on windows and use cmd exe then you can type chcp 65001 to enable utf-8 output  works with Lucida Console font   You might need to add myUnicodeString encode  utf8

User · Answer

Looks like you are using python 2 x   Python 2 x defaults to ascii and it doesn   t know about Unicode  Hence the exception   Just paste the below line after shebang  it will work    - - coding  utf-8 - -

User · Answer

gt  gt  gt  u a     encode  ascii    ignore    a    Decode the string you get back  using either the charset in the the appropriate meta tag in the response or in the Content-Type header  then encode   The method encode encoding  errors  accepts custom handlers for errors  The default values  besides ignore  are    gt  gt  gt  u a     encode  ascii    replace   b a     gt  gt  gt  u a     encode  ascii    xmlcharrefreplace   b a amp  12354  amp  228    gt  gt  gt  u a     encode  ascii    backslashreplace   b a  u3042  xe4    See https   docs python org 3 library stdtypes html str encode

User · Answer

I use this helper function throughout all of my projects  If it can t convert the unicode  it ignores it  This ties into a django library  but with a little research you could bypass it   from django utils import encoding  def convert unicode to string x                gt  gt  gt  convert unicode to string u ni xf1era        niera              return encoding smart str x  encoding  ascii   errors  ignore     I no longer get any unicode errors after using this

User · Answer

unicodestring     xa0   decoded str   unicodestring decode  windows-1252   encoded str   decoded str encode  ascii    ignore     Works for me

User · Answer

You wrote    I assume that means the HTML contains some wrongly-formed attempt at unicode somewhere      The HTML is NOT expected to contain any kind of  attempt at unicode   well-formed or not  It must of necessity contain Unicode characters encoded in some encoding  which is usually supplied up front     look for  charset     You appear to be assuming that the charset is UTF-8     on what grounds  The   xA0  byte that is shown in your error message indicates that you may have a single-byte charset e g  cp1252   If you can t get any sense out of the declaration at the start of the HTML  try using chardet to find out what the likely encoding is   Why have you tagged your question with  regex    Update after you replaced your whole question with a non-question   html   urllib urlopen link  read     html refers to a str object  To get unicode  you need to find out   how it is encoded  and decode it   html encode  utf8   ignore     problem 1  will fail because html is a str object    encode works on unicode objects so Python tries to decode it using     ascii  and fails   problem 2  even if it worked  the result will be ignored  it doesn t    update html in situ  it returns a function result    problem 3   ignore  with UTF-n  any valid unicode object    should be encodable in UTF-n  error implies end of the world    don t try to ignore it  Don t just whack in  ignore  willy-nilly    put it in only with a comment explaining your very cogent reasons for doing so     ignore  with most other encodings  error implies that you are mistaken   in your choice of encoding -- same advice as for UTF-n  -     ignore  with decode latin1 aka iso-8859-1  error implies end of the world    Irrespective of error or not  you are probably mistaken    needing e g  cp1252 or even cp850 instead   -

User · Answer

As an extension to Ignacio Vazquez-Abrams  answer   gt  gt  gt  u a     encode  ascii    ignore    a    It is sometimes desirable to remove accents from characters and print the base form  This can be accomplished with   gt  gt  gt  import unicodedata  gt  gt  gt  unicodedata normalize  NFKD   u a      encode  ascii    ignore    aa    You may also want to translate other characters  such as punctuation  to their nearest equivalents  for instance the RIGHT SINGLE QUOTATION MARK unicode character does not get converted to an ascii APOSTROPHE when encoding    gt  gt  gt  print u  u2019       gt  gt  gt  unicodedata name u  u2019    RIGHT SINGLE QUOTATION MARK   gt  gt  gt  u  u2019  encode  ascii    ignore        Note we get an empty string back  gt  gt  gt  u  u2019  replace u  u2019   u      encode  ascii    ignore         Although there are more efficient ways to accomplish this  See this question for more details Where is Python  39 s  quot best ASCII for this Unicode quot  database

User · Answer

2018 Update   As of February 2018  using compressions like gzip has become quite popular  around 73  of all websites use it  including large sites like Google  YouTube  Yahoo  Wikipedia  Reddit  Stack Overflow and Stack Exchange Network sites   If you do a simple decode like in the original answer with a gzipped response  you ll get an error like or similar to this      UnicodeDecodeError   utf8  codec can t decode byte 0x8b in position 1  unexpected code byte   In order to decode a gzpipped response you need to add the following modules  in Python 3    import gzip import io   Note  In Python 2 you d use StringIO instead of io  Then you can parse the content out like this   response   urlopen  https   example com gzipped-ressource   buffer   io BytesIO response read      Use StringIO StringIO response read    in Python 2 gzipped file   gzip GzipFile fileobj buffer  decoded   gzipped file read   content   decoded decode  utf-8     Replace utf-8 with the source encoding of your requested resource   This code reads the response  and places the bytes in a buffer  The gzip module then reads the buffer using the GZipFile function  After that  the gzipped file can be read into bytes again and decoded to normally readable text in the end   Original Answer from 2010   Can we get the actual value used for link   In addition  we usually encounter this problem here when we are trying to  encode   an already encoded byte string  So you might try to decode it first as in  html   urllib urlopen link  read   unicode str   html decode  lt source encoding gt   encoded str   unicode str encode  utf8     As an example   html     xa0  encoded str   html encode  utf8     Fails with  UnicodeDecodeError   ascii  codec can t decode byte 0xa0 in position 0  ordinal not in range 128    While   html     xa0  decoded str   html decode  windows-1252   encoded str   decoded str encode  utf8     Succeeds without error  Do note that  windows-1252  is something I used as an example  I got this from chardet and it had 0 5 confidence that it is right   well  as given with a 1-character-length string  what do you expect  You should change that to the encoding of the byte string returned from  urlopen   read   to what applies to the content you retrieved   Another problem I see there is that the  encode   string method returns the modified string and does not modify the source in place  So it s kind of useless to have self response out write html  as html is not the encoded string from html encode  if that is what you were originally aiming for    As Ignacio suggested  check the source webpage for the actual encoding of the returned string from read    It s either in one of the Meta tags or in the ContentType header in the response  Use that then as the parameter for  decode     Do note however that it should not be assumed that other developers are responsible enough to make sure the header and or meta character set declarations match the actual content   Which is a PITA  yeah  I should know  I was one of those before

User · Answer

I think the answer is there but only in bits and pieces  which makes it difficult to quickly fix the problem such as   UnicodeDecodeError   ascii  codec can t decode byte 0xa0 in position 2818  ordinal not in range 128    Let s take an example  Suppose I have file which has some data in the following form   containing ascii and non-ascii chars    1 10 17  21 36 - Land   Welcome               and we want to ignore and preserve only ascii characters    This code will do   import unicodedata fp    open  lt FILENAME gt   for line in fp      rline   line strip       rline   unicode rline   utf-8       rline   unicodedata normalize  NFKD   rline  encode  ascii   ignore       if len rline     0          print rline   and type rline  will give you    gt type rline    lt type  str  gt

User · Answer

Use unidecode - it even converts weird characters to ascii instantly  and even converts Chinese to phonetic ascii     pip install unidecode   then    gt  gt  gt  from unidecode import unidecode  gt  gt  gt  unidecode u       Bei Jing   gt  gt  gt  unidecode u   koda    Skoda

User · Answer

If you have a string line  you can use the  encode  encoding    errors  strict    method for strings to convert encoding types   line    my big string   line encode  ascii    ignore    For more information about handling ASCII and unicode in Python  this is a really useful site  https   docs python org 2 howto unicode html

[python] Convert Unicode to ASCII without errors in Python

Examples related to python

Examples related to unicode

Examples related to utf-8

Examples related to character-encoding

Examples related to ascii