[python] Stripping non printable characters from a string in python

I use to run

$s =~ s/[^[:print:]]//g;

on Perl to get rid of non printable characters.

In Python there's no POSIX regex classes, and I can't write [:print:] having it mean what I want. I know of no way in Python to detect if a character is printable or not.

What would you do?

EDIT: It has to support Unicode characters as well. The string.printable way will happily strip them out of the output. curses.ascii.isprint will return false for any unicode character.

This question is related to python string non-printable

The answer is


The best I've come up with now is (thanks to the python-izers above)

def filter_non_printable(str):
  return ''.join([c for c in str if ord(c) > 31 or ord(c) == 9])

This is the only way I've found out that works with Unicode characters/strings

Any better options?


Adapted from answers by Ants Aasma and shawnrad:

nonprintable = set(map(chr, list(range(0,32)) + list(range(127,160))))
ord_dict = {ord(character):None for character in nonprintable}
def filter_nonprintable(text):
    return text.translate(ord_dict)

#use
str = "this is my string"
str = filter_nonprintable(str)
print(str)

tested on Python 3.7.7


You could try setting up a filter using the unicodedata.category() function:

import unicodedata
printable = {'Lu', 'Ll'}
def filter_non_printable(str):
  return ''.join(c for c in str if unicodedata.category(c) in printable)

See Table 4-9 on page 175 in the Unicode database character properties for the available categories


To remove 'whitespace',

import re
t = """
\n\t<p>&nbsp;</p>\n\t<p>&nbsp;</p>\n\t<p>&nbsp;</p>\n\t<p>&nbsp;</p>\n\t<p>
"""
pat = re.compile(r'[\t\n]')
print(pat.sub("", t))

The following will work with Unicode input and is rather fast...

import sys

# build a table mapping all non-printable characters to None
NOPRINT_TRANS_TABLE = {
    i: None for i in range(0, sys.maxunicode + 1) if not chr(i).isprintable()
}

def make_printable(s):
    """Replace non-printable characters in a string."""

    # the translate method on str removes characters
    # that map to None from the string
    return s.translate(NOPRINT_TRANS_TABLE)


assert make_printable('Café') == 'Café'
assert make_printable('\x00\x11Hello') == 'Hello'
assert make_printable('') == ''

My own testing suggests this approach is faster than functions that iterate over the string and return a result using str.join.


You could try setting up a filter using the unicodedata.category() function:

import unicodedata
printable = {'Lu', 'Ll'}
def filter_non_printable(str):
  return ''.join(c for c in str if unicodedata.category(c) in printable)

See Table 4-9 on page 175 in the Unicode database character properties for the available categories


As far as I know, the most pythonic/efficient method would be:

import string

filtered_string = filter(lambda x: x in string.printable, myStr)

To remove 'whitespace',

import re
t = """
\n\t<p>&nbsp;</p>\n\t<p>&nbsp;</p>\n\t<p>&nbsp;</p>\n\t<p>&nbsp;</p>\n\t<p>
"""
pat = re.compile(r'[\t\n]')
print(pat.sub("", t))

The following will work with Unicode input and is rather fast...

import sys

# build a table mapping all non-printable characters to None
NOPRINT_TRANS_TABLE = {
    i: None for i in range(0, sys.maxunicode + 1) if not chr(i).isprintable()
}

def make_printable(s):
    """Replace non-printable characters in a string."""

    # the translate method on str removes characters
    # that map to None from the string
    return s.translate(NOPRINT_TRANS_TABLE)


assert make_printable('Café') == 'Café'
assert make_printable('\x00\x11Hello') == 'Hello'
assert make_printable('') == ''

My own testing suggests this approach is faster than functions that iterate over the string and return a result using str.join.


In Python there's no POSIX regex classes

There are when using the regex library: https://pypi.org/project/regex/

It is well maintained and supports Unicode regex, Posix regex and many more. The usage (method signatures) is very similar to Python's re.

From the documentation:

[[:alpha:]]; [[:^alpha:]]

POSIX character classes are supported. These are normally treated as an alternative form of \p{...}.

(I'm not affiliated, just a user.)


The best I've come up with now is (thanks to the python-izers above)

def filter_non_printable(str):
  return ''.join([c for c in str if ord(c) > 31 or ord(c) == 9])

This is the only way I've found out that works with Unicode characters/strings

Any better options?


Based on @Ber's answer, I suggest removing only control characters as defined in the Unicode character database categories:

import unicodedata
def filter_non_printable(s):
    return ''.join(c for c in s if not unicodedata.category(c).startswith('C'))

You could try setting up a filter using the unicodedata.category() function:

import unicodedata
printable = {'Lu', 'Ll'}
def filter_non_printable(str):
  return ''.join(c for c in str if unicodedata.category(c) in printable)

See Table 4-9 on page 175 in the Unicode database character properties for the available categories


Yet another option in python 3:

re.sub(f'[^{re.escape(string.printable)}]', '', my_string)

As far as I know, the most pythonic/efficient method would be:

import string

filtered_string = filter(lambda x: x in string.printable, myStr)

The best I've come up with now is (thanks to the python-izers above)

def filter_non_printable(str):
  return ''.join([c for c in str if ord(c) > 31 or ord(c) == 9])

This is the only way I've found out that works with Unicode characters/strings

Any better options?


In Python 3,

def filter_nonprintable(text):
    import itertools
    # Use characters of control category
    nonprintable = itertools.chain(range(0x00,0x20),range(0x7f,0xa0))
    # Use translate to remove all non-printable characters
    return text.translate({character:None for character in nonprintable})

See this StackOverflow post on removing punctuation for how .translate() compares to regex & .replace()

The ranges can be generated via nonprintable = (ord(c) for c in (chr(i) for i in range(sys.maxunicode)) if unicodedata.category(c)=='Cc') using the Unicode character database categories as shown by @Ants Aasma.


The best I've come up with now is (thanks to the python-izers above)

def filter_non_printable(str):
  return ''.join([c for c in str if ord(c) > 31 or ord(c) == 9])

This is the only way I've found out that works with Unicode characters/strings

Any better options?


Yet another option in python 3:

re.sub(f'[^{re.escape(string.printable)}]', '', my_string)

As far as I know, the most pythonic/efficient method would be:

import string

filtered_string = filter(lambda x: x in string.printable, myStr)

The one below performs faster than the others above. Take a look

''.join([x if x in string.printable else '' for x in Str])

In Python there's no POSIX regex classes

There are when using the regex library: https://pypi.org/project/regex/

It is well maintained and supports Unicode regex, Posix regex and many more. The usage (method signatures) is very similar to Python's re.

From the documentation:

[[:alpha:]]; [[:^alpha:]]

POSIX character classes are supported. These are normally treated as an alternative form of \p{...}.

(I'm not affiliated, just a user.)


This function uses list comprehensions and str.join, so it runs in linear time instead of O(n^2):

from curses.ascii import isprint

def printable(input):
    return ''.join(char for char in input if isprint(char))

Based on @Ber's answer, I suggest removing only control characters as defined in the Unicode character database categories:

import unicodedata
def filter_non_printable(s):
    return ''.join(c for c in s if not unicodedata.category(c).startswith('C'))

This function uses list comprehensions and str.join, so it runs in linear time instead of O(n^2):

from curses.ascii import isprint

def printable(input):
    return ''.join(char for char in input if isprint(char))

You could try setting up a filter using the unicodedata.category() function:

import unicodedata
printable = {'Lu', 'Ll'}
def filter_non_printable(str):
  return ''.join(c for c in str if unicodedata.category(c) in printable)

See Table 4-9 on page 175 in the Unicode database character properties for the available categories


This function uses list comprehensions and str.join, so it runs in linear time instead of O(n^2):

from curses.ascii import isprint

def printable(input):
    return ''.join(char for char in input if isprint(char))

As far as I know, the most pythonic/efficient method would be:

import string

filtered_string = filter(lambda x: x in string.printable, myStr)

In Python 3,

def filter_nonprintable(text):
    import itertools
    # Use characters of control category
    nonprintable = itertools.chain(range(0x00,0x20),range(0x7f,0xa0))
    # Use translate to remove all non-printable characters
    return text.translate({character:None for character in nonprintable})

See this StackOverflow post on removing punctuation for how .translate() compares to regex & .replace()

The ranges can be generated via nonprintable = (ord(c) for c in (chr(i) for i in range(sys.maxunicode)) if unicodedata.category(c)=='Cc') using the Unicode character database categories as shown by @Ants Aasma.


Adapted from answers by Ants Aasma and shawnrad:

nonprintable = set(map(chr, list(range(0,32)) + list(range(127,160))))
ord_dict = {ord(character):None for character in nonprintable}
def filter_nonprintable(text):
    return text.translate(ord_dict)

#use
str = "this is my string"
str = filter_nonprintable(str)
print(str)

tested on Python 3.7.7


This function uses list comprehensions and str.join, so it runs in linear time instead of O(n^2):

from curses.ascii import isprint

def printable(input):
    return ''.join(char for char in input if isprint(char))