UnicodeEncodeError ascii codec can t encode character u xa0 in position 20 ordinal not in range 128

Question

I m having problems dealing with unicode characters from text fetched from different web pages  on different sites   I am using BeautifulSoup    The problem is that the error is not always reproducible  it sometimes works with some pages  and sometimes  it barfs by throwing a UnicodeEncodeError  I have tried just about everything I can think of  and yet I have not found anything that works consistently without throwing some kind of Unicode-related error   One of the sections of code that is causing problems is shown below   agent telno   agent find  div    agent contact number   agent telno      if agent telno is None else agent telno contents 0  p agent info   str agent contact         agent telno  strip     Here is a stack trace produced on SOME strings when the snippet above is run   Traceback  most recent call last     File  foobar py   line 792  in  lt module gt      p agent info   str agent contact         agent telno  strip   UnicodeEncodeError   ascii  codec can t encode character u  xa0  in position 20  ordinal not in range 128    I suspect that this is because some pages  or more specifically  pages from some of the sites  may be encoded  whilst others may be unencoded  All the sites are based in the UK and provide data meant for UK consumption - so there are no issues relating to internalization or dealing with text written in anything other than English   Does anyone have any ideas as to how to solve this so that I can CONSISTENTLY fix this problem

User · Answer

In general case of writing this unsupported encoding string  let s say data that causes this error  to some file  for e g  results txt   this works   f   open  results txt    w     f write data that causes this error encode  utf-8      f close

User · Answer

Simple helper functions found here   def safe unicode obj   args           return the unicode representation of obj         try          return unicode obj   args      except UnicodeDecodeError            obj is byte string         ascii text   str obj  encode  string escape           return unicode ascii text   def safe str obj           return the byte string representation of obj         try          return str obj      except UnicodeEncodeError            obj is unicode         return unicode obj  encode  unicode escape

User · Answer

We struck this error when running manage py migrate in Django with localized fixtures   Our source contained the   - - coding  utf-8 - - declaration  MySQL was correctly configured for utf8 and Ubuntu had the appropriate language pack and values in  etc default locale   The issue was simply that the Django container  we use docker  was missing the LANG env var   Setting LANG to en US UTF-8 and restarting the container before re-running migrations fixed the problem

User · Answer

Update for python 3 0 and later  Try the following in the python editor   locale-gen en US UTF-8 export LANG en US UTF-8 LANGUAGE en US en LC ALL en US UTF-8   This sets the system s default locale encoding to the UTF-8 format   More can be read here at PEP 538 -- Coercing the legacy C locale to a UTF-8 based locale

User · Answer

Try to avoid conversion of variable to str variable   Sometimes  It may cause the issue   Simple tip to avoid    try       data str data  except      data   data  Don t convert to String   The above example will solve Encode error also

User · Answer

The problem is that you re trying to print a unicode character  but your terminal doesn t support it   You can try installing language-pack-en package to fix that   sudo apt-get install language-pack-en   which provides English translation data updates for all supported packages  including Python   Install different language package if necessary  depending which characters you re trying to print    On some Linux distributions it s required in order to make sure that the default English locales are set-up properly  so unicode characters can be handled by shell terminal   Sometimes it s easier to install it  than configuring it manually   Then when writing the code  make sure you use the right encoding in your code   For example   open foo  encoding  utf-8     If you ve still a problem  double check your system configuration  such as    Your locale file   etc default locale   which should have e g   LANG  en US UTF-8  LC ALL  en US UTF-8    or   LC ALL C UTF-8 LANG C UTF-8  Value of LANG LC CTYPE in shell  Check which locale your shell supports by   locale -a   grep  UTF-8       Demonstrating the problem and solution in fresh VM    Initialize and provision the VM  e g  using vagrant    vagrant init ubuntu trusty64  vagrant up  vagrant ssh   See  available Ubuntu boxes   Printing unicode characters  such as trade mark sign like           python -c  print u  u2122     Traceback  most recent call last     File   lt string gt    line 1  in  lt module gt  UnicodeEncodeError   ascii  codec can t encode character u  u2122  in position 0  ordinal not in range 128   Now installing language-pack-en     sudo apt-get -y install language-pack-en The following extra packages will be installed    language-pack-en-base Generating locales      en GB UTF-8     usr sbin locale-gen  done Generation complete   Now problem should be solved     python -c  print u  u2122          Otherwise  try the following command     LC ALL C UTF-8 python -c  print u  u2122

User · Answer

Please open terminal and fire the below command   export LC ALL  en US UTF-8

User · Answer

I always put the code below in the first two lines of the python files     - - coding  utf-8 - - from   future   import unicode literals

User · Answer

This is a classic python unicode pain point  Consider the following   a   u bats u00E0  print a    gt  bats     All good so far  but if we call str a   let s see what happens   str a  Traceback  most recent call last     File   lt stdin gt    line 1  in  lt module gt  UnicodeEncodeError   ascii  codec can t encode character u  xe0  in position 4  ordinal not in range 128    Oh dip  that s not gonna do anyone any good  To fix the error  encode the bytes explicitly with  encode and tell python what codec to use   a encode  utf-8      gt   bats xc3 xa0  print a encode  utf-8      gt  bats     Voil u00E0   The issue is that when you call str    python uses the default character encoding to try and encode the bytes you gave it  which in your case are sometimes representations of unicode characters  To fix the problem  you have to tell python how to deal with the string you give it by using  encode  whatever unicode    Most of the time  you should be fine using utf-8   For an excellent exposition on this topic  see Ned Batchelder s PyCon talk here  http   nedbatchelder com text unipain html

User · Answer

This problem often happens  when a django project deploys using Apache  Because Apache sets environment variable LANG C in  etc sysconfig httpd  Just open the file and comment  or change to your flavior  this setting  Or use the lang option of the WSGIDaemonProcess command  in this case you will be able to set different LANG environment variable  to different virtualhosts

User · Answer

This will work     gt  gt  gt print unicodedata normalize  NFD   re sub                         bats xc3 xa0    encode  ascii    ignore      Output    gt  gt  gt bats

User · Answer

If you have something like packet data    This is data  then do this on the next line  right after initializing packet data   unic   u   packet data   unic

User · Answer

Below solution worked for me  Just added      u  String     representing the string as unicode  before my string   result html   result to html col space 1  index False  justify   right     text   u     lt html gt   lt body gt   lt p gt  Hello all   lt br gt   lt br gt  Here s weekly summary report   Let me know if you have any questions   lt br gt   lt br gt  Data Summary  lt br gt   lt br gt   lt br gt   0   lt  p gt   lt p gt Thanks  lt  p gt   lt p gt Data Team lt  p gt   lt  body gt  lt  html gt      format result html

User · Answer

A subtle problem causing even print to fail is having your environment variables set wrong  eg  here LC ALL set to  C    In Debian they discourage setting it  Debian wiki on Locale    echo  LANG en US utf8   echo  LC ALL  C   python -c  print  u voil u00e0    Traceback  most recent call last     File   lt string gt    line 1  in  lt module gt  UnicodeEncodeError   ascii  codec can t encode character u  xe0  in position 4  ordinal not in range 128    export LC ALL  en US utf8    python -c  print  u voil u00e0    voil     unset LC ALL   python -c  print  u voil u00e0    voil

User · Answer

I ve actually found that in most of my cases  just stripping out those characters is much simpler   s   mystring decode  ascii    ignore

User · Answer

I had this issue trying to output Unicode characters to stdout  but with sys stdout write  rather than print  so that I could support output to a different file as well    From BeautifulSoup s own documentation  I solved this with the codecs library   import sys import codecs  def main fIn  fOut       soup   BeautifulSoup fIn        Do processing  with data including non-ASCII characters     fOut write unicode soup    if   name         main         with  sys stdin  as fIn    Don t think we need codecs getreader here         with codecs getwriter  utf-8   sys stdout  as fOut              main fIn  fOut

User · Answer

Here s a rehashing of some other so-called  quot cop out quot  answers   There are situations in which simply throwing away the troublesome characters strings is a good solution  despite the protests voiced here  def safeStr obj       try  return str obj      except UnicodeEncodeError          return obj encode  ascii    ignore   decode  ascii       except  return  quot  quot   Testing it  if   name         main          print safeStr  1        print safeStr   quot test quot         print u 98 xb0      print safeStr  u 98 xb0     Results  1 test 98   98  UPDATE  My original answer was written for Python 2  For Python 3  def safeStr obj       try  return str obj  encode  ascii    ignore   decode  ascii       except  return  quot  quot   Note  if you d prefer to leave a   indicator where the  quot unsafe quot  unicode characters are  specify replace instead of ignore in the call to encode for the error handler  Suggestion  you might want to name this function toAscii instead   That s a matter of preference    Finally  here s a more robust PY2 3 version using six  where I opted to use replace  and peppered in some character swaps to replace fancy unicode quotes and apostrophes which curl left or right with the simple vertical ones that are part of the ascii set   You might expand on such swaps yourself  from six import PY2  iteritems   CHAR SWAP     u  u201c   u  quot                 u  u201D   u  quot                  u  u2018   u quot   quot                 u  u2019   u quot   quot      def toAscii  text             try          for k v in iteritems  CHAR SWAP                 text   text replace k v      except  pass          try  return str  text   if PY2 else bytes  text   replace    decode  ascii       except UnicodeEncodeError          return text encode  ascii    replace   decode  ascii       except  return  quot  quot   if   name         main              print  toAscii  u testin u2019

User · Answer

In shell    Find supported UTF-8 locale by the following command   locale -a   grep  UTF-8   Export it  before running the script  e g    export LC ALL   locale -a   grep UTF-8    or manually like   export LC ALL C UTF-8  Test it by printing special character  e g        python -c  print u  u2122        Above tested in Ubuntu

User · Answer

Many answers here   agf and  Andbdrew for example  have already addressed the most immediate aspects of the OP question   However  I think there is one subtle but important aspect that has been largely ignored and that matters dearly for everyone who like me ended up here while trying to make sense of encodings in Python  Python 2 vs Python 3 management of character representation  is wildly different  I feel like a big chunk of confusion out there has to do with people reading about encodings in Python without being version aware    I suggest anyone interested in understanding the root cause of OP problem to begin by reading Spolsky s introduction to character representations and Unicode and then move to Batchelder on Unicode in Python 2 and Python 3

User · Answer

I just used the following   import unicodedata message   unicodedata normalize  NFKD   message    Check what documentation says about it      unicodedata normalize form  unistr  Return the normal form form for   the Unicode string unistr  Valid values for form are    NFC        NFKC          NFD     and    NFKD          The Unicode standard defines various normalization forms of a Unicode   string  based on the definition of canonical equivalence and   compatibility equivalence  In Unicode  several characters can be   expressed in various way  For example  the character U 00C7  LATIN   CAPITAL LETTER C WITH CEDILLA  can also be expressed as the sequence   U 0043  LATIN CAPITAL LETTER C  U 0327  COMBINING CEDILLA        For each character  there are two normal forms  normal form C and   normal form D  Normal form D  NFD  is also known as canonical   decomposition  and translates each character into its decomposed form    Normal form C  NFC  first applies a canonical decomposition  then   composes pre-combined characters again       In addition to these two forms  there are two additional normal forms   based on compatibility equivalence  In Unicode  certain characters are   supported which normally would be unified with other characters  For   example  U 2160  ROMAN NUMERAL ONE  is really the same thing as U 0049    LATIN CAPITAL LETTER I   However  it is supported in Unicode for   compatibility with existing character sets  e g  gb2312        The normal form KD  NFKD  will apply the compatibility decomposition    i e  replace all compatibility characters with their equivalents  The   normal form KC  NFKC  first applies the compatibility decomposition    followed by the canonical composition       Even if two unicode strings are normalized and look the same to a   human reader  if one has combining characters and the other doesn   t    they may not compare equal    Solves it for me  Simple and easy

User · Answer

You need to read the Python Unicode HOWTO  This error is the very first example   Basically  stop using str to convert from unicode to encoded text   bytes   Instead  properly use  encode   to encode the string   p agent info   u    join  agent contact  agent telno   encode  utf-8   strip     or work entirely in unicode

User · Answer

well i tried everything but it did not help  after googling around i figured the following and it helped  python 2 7 is in use     encoding utf8 import sys reload sys  sys setdefaultencoding  utf8

User · Answer

I just had this problem  and Google led me here  so just to add to the general solutions here  this is what worked for me      value  contains the problematic data unic   u   unic    value value   unic   I had this idea after reading Ned s presentation   I don t claim to fully understand why this works  though  So if anyone can edit this answer or put in a comment to explain  I ll appreciate it

User · Answer

Add line below at the beginning of your script   or as second line      - - coding  utf-8 - -   That s definition of python source code encoding  More info in PEP 263

User · Answer

The recommended solution did not work for me  and I could live with dumping all non ascii characters  so   s   s encode  ascii  errors  ignore     which left me with something stripped that doesn t throw errors

User · Answer

I found elegant work around for me to remove symbols and continue to keep string as string in follows   yourstring   yourstring encode  ascii    ignore   decode  ascii     It s important to notice that using the ignore option is dangerous because it silently drops any unicode and internationalization  support from the code that uses it  as seen here  convert unicode     gt  gt  gt  u City  Malm    encode  ascii    ignore   decode  ascii    City  Malm

User · Answer

Just add to a variable encode  utf-8    agent contact encode  utf-8

User · Answer

For me  what worked was   BeautifulSoup html text from encoding  utf-8     Hope this helps someone

User · Answer

Alas this works in Python 3 at least     Python 3  Sometimes the error is in the enviroment variables and enconding so  import os import locale os environ  PYTHONIOENCODING      utf-8  myLocale locale setlocale category locale LC ALL  locale  en GB UTF-8        print myText encode  utf-8   errors  ignore      where errors are ignored in encoding

[python] UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)

Examples related to python

Examples related to unicode

Examples related to beautifulsoup

Examples related to python-2.x

Examples related to python-unicode