How can I remove non-ASCII characters but leave periods and spaces using Python

Question

I m working with a  txt file  I want a string of the text from the file with no non-ASCII characters  However  I want to leave spaces and periods  At present  I m stripping those too  Here s the code   def onlyascii char       if ord char   lt  48 or ord char   gt  127  return        else  return char  def get my string file path       f open file path  r       data f read       f close       filtered data filter onlyascii  data      filtered data   filtered data lower       return filtered data   How should I modify onlyascii   to leave spaces and periods  I imagine it s not too complicated but I can t figure it out

User · Answer

An easy way to change to a different codec, is by using encode() or decode(). In your case, you want to convert to ASCII and ignore all symbols that are not supported. For example, the Swedish letter å is not an ASCII character:

    >>>s = u'Good bye in Swedish is Hej d\xe5'
    >>>s = s.encode('ascii',errors='ignore')
    >>>print s
    Good bye in Swedish is Hej d

Edit:

Python3: str -> bytes -> str

>>>"Hej då".encode("ascii", errors="ignore").decode()
'hej d'

Python2: unicode -> str -> unicode

>>> u"hej då".encode("ascii", errors="ignore").decode()
u'hej d'

Python2: str -> unicode -> str (decode and encode in reverse order)

>>> "hej d\xe5".decode("ascii", errors="ignore").encode()
'hej d'

User · Answer

If you want printable ascii characters you probably should correct your code to:

if ord(char) < 32 or ord(char) > 126: return ''

this is equivalent, to string.printable (answer from @jterrace), except for the absence of returns and tabs ('\t','\n','\x0b','\x0c' and '\r') but doesnt correspond to the range on your question

User · Answer

Your question is ambiguous  the first two sentences taken together imply that you believe that space and  period  are non-ASCII characters  This is incorrect  All chars such that ord char   lt   127 are ASCII characters  For example  your function excludes these characters       amp        -   but includes several others e g         Please step back  think a bit  and edit your question to tell us what you are trying to do  without mentioning the word ASCII  and why you think that chars such that ord char     128 are ignorable  Also  which version of Python  What is the encoding of your input data   Please note that your code reads the whole input file as a single string  and your comment   great solution   to another answer implies that you don t care about newlines in your data  If your file contains two lines like this   this is line 1 this is line 2   the result would be  this is line 1this is line 2      is that what you really want   A greater solution would include    a better name for the filter function than onlyascii   recognition that a filter function merely needs to return a truthy value if the argument is to be retained   def filter func char       return char      n  or 32  lt   ord char   lt   126   and later  filtered data   filter filter func  data  lower

User · Answer

You may use the following code to remove non-English letters   import re str    123456790 ABC            result   re sub r    x00- x7f   r    str  print result    This will return     123456790 ABC

User · Answer

You can filter all characters from the string that are not printable using string printable  like this    gt  gt  gt  s    some x00string  with x15 funny characters   gt  gt  gt  import string  gt  gt  gt  printable   set string printable   gt  gt  gt  filter lambda x  x in printable  s   somestring  with funny characters    string printable on my machine contains   0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ       amp        -     lt   gt                t n r x0b x0c   EDIT  On Python 3  filter will return an iterable  The correct way to obtain a string back would be      join filter lambda x  x in printable  s

User · Answer

According to  artfulrobot  this should be faster than filter and lambda  import re re sub r    x00- x7f   r    your-non-ascii-string    See more examples here  Replace non-ASCII characters with a single space

User · Answer

Working my way through Fluent Python  Ramalho  - highly recommended  List comprehension one-ish-liners inspired by Chapter 2   onlyascii      join  s for s in data if ord s   lt  127   onlymatch      join  s for s in data if s in                ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz

[python] How can I remove non-ASCII characters but leave periods and spaces using Python?

The answer is

Examples related to python

Examples related to text

Examples related to unicode

Examples related to filter

Examples related to ascii

Tags