How to make the python interpreter correctly handle non-ASCII characters in string operations

Question

I have a string that looks like so   6    918    417    712   The clear cut way to trim this string  as I understand Python  is simply to say the string is in a variable called s  we get   s replace              That should do the trick  But of course it complains that the non-ASCII character   xc2  in file blabla py is not encoded   I never quite could understand how to switch between different encodings   Here s the code  it really is just the same as above  but now it s in context  The file is saved as UTF-8 in notepad and has the following header      usr bin python2 4   - - coding  utf-8 - -   The code   f   urllib urlopen url   soup   BeautifulSoup f   s   soup find  div     id   main count      making a print  s  here goes well  it shows 6   918   417   712  s replace            save main count s    It gets no further than s replace

User · Answer

This is a dirty hack  but may work   s2      for i in s      if ord i   lt  128          s2    i

User · Answer

I know it s an old thread  but I felt compelled to mention the translate method  which is always a good way to replace all character codes above 128  or other if necessary    Usage   str translate table   deletechars     gt  gt  gt  trans table      join   chr i  for i in range 128             128     gt  gt  gt   R  sultat  translate trans table   R sultat   gt  gt  gt   6   918   417   712  translate trans table   6  918  417  712    Starting with Python 2 6  you can also set the table to None  and use deletechars to delete the characters you don t want as in the examples shown in the standard docs at http   docs python org library stdtypes html   With unicode strings  the translation table is not a 256-character string but a dict with the ord   of relevant characters as keys  But anyway getting a proper ascii string from a unicode string is simple enough  using the method mentioned by truppo above  namely   unicode string encode  ascii    ignore    As a summary  if for some reason you absolutely need to get an ascii string  for instance  when you raise a standard exception with raise Exception  ascii message    you can use the following function   trans table      join   chr i  for i in range 128             128   def ascii s       if isinstance s  unicode           return s encode  ascii    replace       else          return s translate trans table    The good thing with translate is that you can actually convert accented characters to relevant non-accented ascii characters instead of simply deleting them or replacing them by      This is often useful  for instance for indexing purposes

User · Answer

def removeNonAscii s   return    join filter lambda x  ord x  lt 128  s     edit  my first impulse is always to use a filter  but the generator expression is more memory efficient  and shorter      def removeNonAscii s   return    join i for i in s if ord i  lt 128    Keep in mind that this is guaranteed to work with UTF-8 encoding  because all bytes in multi-byte characters have the highest bit set to 1

User · Answer

Way too late for an answer  but the original string was in UTF-8 and   xc2 xa0  is UTF-8 for NO-BREAK SPACE   Simply decode the original string as s decode  utf-8     xa0 displays as a space when decoded incorrectly as Windows-1252 or latin-1   Example  Python 3   s   b 6 xc2 xa0918 xc2 xa0417 xc2 xa0712  print s decode  latin-1      incorrectly decoded u   s decode  utf8     correctly decoded print u  print u replace   N NO-BREAK SPACE         print u replace   xa0   -       xa0 is Unicode for NO-BREAK SPACE   Output  6    918    417    712 6  918  417  712 6 918 417 712 6-918-417-712

User · Answer

s replace u                          u before string is important   and make your  py file unicode

User · Answer

Python 2 uses ascii as the default encoding for source files  which means you must specify another encoding at the top of the file to use non-ascii unicode characters in literals  Python 3 uses utf-8 as the default encoding for source files  so this is less of an issue   See  http   docs python org tutorial interpreter html source-code-encoding  To enable utf-8 source encoding  this would go in one of the top two lines     - - coding  utf-8 - -   The above is in the docs  but this also works     coding  utf-8   Additional considerations    The source file must be saved using the correct encoding in your text editor as well  In Python 2  the unicode literal must have a u before it  as in s replace u       u    But in Python 3  just use quotes  In Python 2  you can from   future   import unicode literals to obtain the Python 3 behavior  but be aware this affects the entire current module  s replace u       u    will also fail if s is not a unicode string  string replace returns a new string and does not edit in place  so make sure you re using the return value as well

User · Answer

Using Regex   import re  strip unicode   re compile     - a-zA-Z0-9     amp                                       lt   gt         s      print strip unicode sub     u 6   918   417   712

User · Answer

usr bin env python   - - coding  utf-8 - -  s   u 6   918   417   712  s   s replace u           print s   This will print out 6 918 417 712

User · Answer

For what it was worth  my character set was utf-8 and I had included the classic    - - coding  utf-8 - -  line   However  I discovered that I didn t have Universal Newlines when reading this data from a webpage    My text had two words  separated by   r n   I was only splitting on the  n and replacing the   n    Once I looped through and saw the character set in question  I realized the mistake   So  it could also be within the ASCII character set  but a character that you didn t expect

User · Answer

The following code will replace all non ASCII characters with question marks      join  x if ord x   lt  128 else     for x in s

User · Answer

gt  gt  gt  unicode string   u hello a  b  c     gt  gt  gt  unicode string encode  ascii    ignore    hello abc

[python] How to make the python interpreter correctly handle non-ASCII characters in string operations?

Examples related to python

Examples related to unicode