Stripping non printable characters from a string in python

Question

I use to run   s    s     print     g    on Perl to get rid of non printable characters    In Python there s no POSIX regex classes  and I can t write   print   having it mean what I want  I know of no way in Python to detect if a character is printable or not    What would you do    EDIT  It has to support Unicode characters as well  The string printable way will happily strip them out of the output   curses ascii isprint will return false for any unicode character

User · Accepted Answer

Iterating over strings is unfortunately rather slow in Python. Regular expressions are over an order of magnitude faster for this kind of thing. You just have to build the character class yourself. The unicodedata module is quite helpful for this, especially the unicodedata.category() function. See Unicode Character Database for descriptions of the categories.

import unicodedata, re, itertools, sys

all_chars = (chr(i) for i in range(sys.maxunicode))
categories = {'Cc'}
control_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories)
# or equivalently and much more efficiently
control_chars = ''.join(map(chr, itertools.chain(range(0x00,0x20), range(0x7f,0xa0))))

control_char_re = re.compile('[%s]' % re.escape(control_chars))

def remove_control_chars(s):
    return control_char_re.sub('', s)

For Python2

import unicodedata, re, sys

all_chars = (unichr(i) for i in xrange(sys.maxunicode))
categories = {'Cc'}
control_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories)
# or equivalently and much more efficiently
control_chars = ''.join(map(unichr, range(0x00,0x20) + range(0x7f,0xa0)))

control_char_re = re.compile('[%s]' % re.escape(control_chars))

def remove_control_chars(s):
    return control_char_re.sub('', s)

For some use-cases, additional categories (e.g. all from the control group might be preferable, although this might slow down the processing time and increase memory usage significantly. Number of characters per category:

Cc (control): 65
Cf (format): 161
Cs (surrogate): 2048
Co (private-use): 137468
Cn (unassigned): 836601

Edit Adding suggestions from the comments.

User · Answer

Yet another option in python 3   re sub f    re escape string printable          my string

User · Answer

In Python there s no POSIX regex classes   There are when using the regex library  https   pypi org project regex   It is well maintained and supports Unicode regex  Posix regex and many more  The usage  method signatures  is very similar to Python s re   From the documentation         alpha         alpha         POSIX character classes are supported  These   are normally treated as an alternative form of  p          I m not affiliated  just a user

User · Answer

The best I ve come up with now is  thanks to the python-izers above    def filter non printable str     return    join  c for c in str if ord c   gt  31 or ord c     9     This is the only way I ve found out that works with Unicode characters strings  Any better options

User · Answer

The one below performs faster than the others above  Take a look     join  x if x in string printable else    for x in Str

User · Answer

This function uses list comprehensions and str join  so it runs in linear time instead of O n 2    from curses ascii import isprint  def printable input       return    join char for char in input if isprint char

User · Answer

Yet another option in python 3   re sub f    re escape string printable          my string

User · Answer

In Python 3  def filter nonprintable text       import itertools       Use characters of control category     nonprintable   itertools chain range 0x00 0x20  range 0x7f 0xa0         Use translate to remove all non-printable characters     return text translate  character None for character in nonprintable    See this StackOverflow post on removing punctuation for how  translate   compares to regex  amp   replace   The ranges can be generated via nonprintable    ord c  for c in  chr i  for i in range sys maxunicode   if unicodedata category c    Cc   using the Unicode character database categories as shown by  Ants Aasma

User · Answer

The best I ve come up with now is  thanks to the python-izers above    def filter non printable str     return    join  c for c in str if ord c   gt  31 or ord c     9     This is the only way I ve found out that works with Unicode characters strings  Any better options

User · Answer

Adapted from answers by Ants Aasma and shawnrad   nonprintable   set map chr  list range 0 32     list range 127 160     ord dict    ord character  None for character in nonprintable  def filter nonprintable text       return text translate ord dict    use str    this is my string  str   filter nonprintable str  print str    tested on Python 3 7 7

User · Answer

To remove  whitespace    import re t        n t lt p gt  amp nbsp  lt  p gt  n t lt p gt  amp nbsp  lt  p gt  n t lt p gt  amp nbsp  lt  p gt  n t lt p gt  amp nbsp  lt  p gt  n t lt p gt      pat   re compile r   t n    print pat sub     t

User · Answer

As far as I know  the most pythonic efficient method would be   import string  filtered string   filter lambda x  x in string printable  myStr

User · Answer

You could try setting up a filter using the unicodedata category   function   import unicodedata printable     Lu    Ll   def filter non printable str     return    join c for c in str if unicodedata category c  in printable    See Table 4-9 on page 175 in the Unicode database character properties for the available categories

User · Answer

Adapted from answers by Ants Aasma and shawnrad   nonprintable   set map chr  list range 0 32     list range 127 160     ord dict    ord character  None for character in nonprintable  def filter nonprintable text       return text translate ord dict    use str    this is my string  str   filter nonprintable str  print str    tested on Python 3 7 7

User · Answer

Based on  Ber s answer  I suggest removing only control characters as defined in the Unicode character database categories  import unicodedata def filter non printable s       return    join c for c in s if not unicodedata category c  startswith  C

User · Answer

You could try setting up a filter using the unicodedata category   function   import unicodedata printable     Lu    Ll   def filter non printable str     return    join c for c in str if unicodedata category c  in printable    See Table 4-9 on page 175 in the Unicode database character properties for the available categories

User · Answer

The following will work with Unicode input and is rather fast     import sys    build a table mapping all non-printable characters to None NOPRINT TRANS TABLE         i  None for i in range 0  sys maxunicode   1  if not chr i  isprintable      def make printable s          Replace non-printable characters in a string            the translate method on str removes characters       that map to None from the string     return s translate NOPRINT TRANS TABLE    assert make printable  Caf         Caf    assert make printable   x00 x11Hello       Hello  assert make printable             My own testing suggests this approach is faster than functions that iterate over the string and return a result using str join

User · Answer

This function uses list comprehensions and str join  so it runs in linear time instead of O n 2    from curses ascii import isprint  def printable input       return    join char for char in input if isprint char

User · Answer

This function uses list comprehensions and str join  so it runs in linear time instead of O n 2    from curses ascii import isprint  def printable input       return    join char for char in input if isprint char

User · Answer

In Python 3  def filter nonprintable text       import itertools       Use characters of control category     nonprintable   itertools chain range 0x00 0x20  range 0x7f 0xa0         Use translate to remove all non-printable characters     return text translate  character None for character in nonprintable    See this StackOverflow post on removing punctuation for how  translate   compares to regex  amp   replace   The ranges can be generated via nonprintable    ord c  for c in  chr i  for i in range sys maxunicode   if unicodedata category c    Cc   using the Unicode character database categories as shown by  Ants Aasma

User · Answer

You could try setting up a filter using the unicodedata category   function   import unicodedata printable     Lu    Ll   def filter non printable str     return    join c for c in str if unicodedata category c  in printable    See Table 4-9 on page 175 in the Unicode database character properties for the available categories

User · Answer

The best I ve come up with now is  thanks to the python-izers above    def filter non printable str     return    join  c for c in str if ord c   gt  31 or ord c     9     This is the only way I ve found out that works with Unicode characters strings  Any better options

User · Answer

To remove  whitespace    import re t        n t lt p gt  amp nbsp  lt  p gt  n t lt p gt  amp nbsp  lt  p gt  n t lt p gt  amp nbsp  lt  p gt  n t lt p gt  amp nbsp  lt  p gt  n t lt p gt      pat   re compile r   t n    print pat sub     t

User · Answer

This function uses list comprehensions and str join  so it runs in linear time instead of O n 2    from curses ascii import isprint  def printable input       return    join char for char in input if isprint char

User · Answer

As far as I know  the most pythonic efficient method would be   import string  filtered string   filter lambda x  x in string printable  myStr

User · Answer

The one below performs faster than the others above  Take a look     join  x if x in string printable else    for x in Str

User · Answer

Based on  Ber s answer  I suggest removing only control characters as defined in the Unicode character database categories  import unicodedata def filter non printable s       return    join c for c in s if not unicodedata category c  startswith  C

User · Answer

The following will work with Unicode input and is rather fast     import sys    build a table mapping all non-printable characters to None NOPRINT TRANS TABLE         i  None for i in range 0  sys maxunicode   1  if not chr i  isprintable      def make printable s          Replace non-printable characters in a string            the translate method on str removes characters       that map to None from the string     return s translate NOPRINT TRANS TABLE    assert make printable  Caf         Caf    assert make printable   x00 x11Hello       Hello  assert make printable             My own testing suggests this approach is faster than functions that iterate over the string and return a result using str join

User · Answer

The best I ve come up with now is  thanks to the python-izers above    def filter non printable str     return    join  c for c in str if ord c   gt  31 or ord c     9     This is the only way I ve found out that works with Unicode characters strings  Any better options

User · Answer

As far as I know  the most pythonic efficient method would be   import string  filtered string   filter lambda x  x in string printable  myStr

User · Answer

In Python there s no POSIX regex classes   There are when using the regex library  https   pypi org project regex   It is well maintained and supports Unicode regex  Posix regex and many more  The usage  method signatures  is very similar to Python s re   From the documentation         alpha         alpha         POSIX character classes are supported  These   are normally treated as an alternative form of  p          I m not affiliated  just a user

User · Answer

You could try setting up a filter using the unicodedata category   function   import unicodedata printable     Lu    Ll   def filter non printable str     return    join c for c in str if unicodedata category c  in printable    See Table 4-9 on page 175 in the Unicode database character properties for the available categories

User · Answer

As far as I know  the most pythonic efficient method would be   import string  filtered string   filter lambda x  x in string printable  myStr

[python] Stripping non printable characters from a string in python

Examples related to python

Examples related to string

Examples related to non-printable