[python] Stripping non printable characters from a string in python

I use to run

$s =~ s/[^[:print:]]//g;

on Perl to get rid of non printable characters.

In Python there's no POSIX regex classes, and I can't write [:print:] having it mean what I want. I know of no way in Python to detect if a character is printable or not.

What would you do?

EDIT: It has to support Unicode characters as well. The string.printable way will happily strip them out of the output. curses.ascii.isprint will return false for any unicode character.

This question is related to python string non-printable

The answer is

Iterating over strings is unfortunately rather slow in Python. Regular expressions are over an order of magnitude faster for this kind of thing. You just have to build the character class yourself. The unicodedata module is quite helpful for this, especially the unicodedata.category() function. See Unicode Character Database for descriptions of the categories.

import unicodedata, re, itertools, sys

all_chars = (chr(i) for i in range(sys.maxunicode))
categories = {'Cc'}
control_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories)
# or equivalently and much more efficiently
control_chars = ''.join(map(chr, itertools.chain(range(0x00,0x20), range(0x7f,0xa0))))

control_char_re = re.compile('[%s]' % re.escape(control_chars))

def remove_control_chars(s):
    return control_char_re.sub('', s)

For Python2

import unicodedata, re, sys

all_chars = (unichr(i) for i in xrange(sys.maxunicode))
categories = {'Cc'}
control_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories)
# or equivalently and much more efficiently
control_chars = ''.join(map(unichr, range(0x00,0x20) + range(0x7f,0xa0)))

control_char_re = re.compile('[%s]' % re.escape(control_chars))

def remove_control_chars(s):
    return control_char_re.sub('', s)

For some use-cases, additional categories (e.g. all from the control group might be preferable, although this might slow down the processing time and increase memory usage significantly. Number of characters per category:

  • Cc (control): 65
  • Cf (format): 161
  • Cs (surrogate): 2048
  • Co (private-use): 137468
  • Cn (unassigned): 836601

Edit Adding suggestions from the comments.

As far as I know, the most pythonic/efficient method would be:

import string

filtered_string = filter(lambda x: x in string.printable, myStr)

You could try setting up a filter using the unicodedata.category() function:

import unicodedata
printable = {'Lu', 'Ll'}
def filter_non_printable(str):
  return ''.join(c for c in str if unicodedata.category(c) in printable)

See Table 4-9 on page 175 in the Unicode database character properties for the available categories

In Python 3,

def filter_nonprintable(text):
    import itertools
    # Use characters of control category
    nonprintable = itertools.chain(range(0x00,0x20),range(0x7f,0xa0))
    # Use translate to remove all non-printable characters
    return text.translate({character:None for character in nonprintable})

See this StackOverflow post on removing punctuation for how .translate() compares to regex & .replace()

The ranges can be generated via nonprintable = (ord(c) for c in (chr(i) for i in range(sys.maxunicode)) if unicodedata.category(c)=='Cc') using the Unicode character database categories as shown by @Ants Aasma.

The following will work with Unicode input and is rather fast...

import sys

# build a table mapping all non-printable characters to None
    i: None for i in range(0, sys.maxunicode + 1) if not chr(i).isprintable()

def make_printable(s):
    """Replace non-printable characters in a string."""

    # the translate method on str removes characters
    # that map to None from the string
    return s.translate(NOPRINT_TRANS_TABLE)

assert make_printable('Café') == 'Café'
assert make_printable('\x00\x11Hello') == 'Hello'
assert make_printable('') == ''

My own testing suggests this approach is faster than functions that iterate over the string and return a result using str.join.

Yet another option in python 3:

re.sub(f'[^{re.escape(string.printable)}]', '', my_string)

This function uses list comprehensions and str.join, so it runs in linear time instead of O(n^2):

from curses.ascii import isprint

def printable(input):
    return ''.join(char for char in input if isprint(char))

The best I've come up with now is (thanks to the python-izers above)

def filter_non_printable(str):
  return ''.join([c for c in str if ord(c) > 31 or ord(c) == 9])

This is the only way I've found out that works with Unicode characters/strings

Any better options?

The one below performs faster than the others above. Take a look

''.join([x if x in string.printable else '' for x in Str])

In Python there's no POSIX regex classes

There are when using the regex library: https://pypi.org/project/regex/

It is well maintained and supports Unicode regex, Posix regex and many more. The usage (method signatures) is very similar to Python's re.

From the documentation:

[[:alpha:]]; [[:^alpha:]]

POSIX character classes are supported. These are normally treated as an alternative form of \p{...}.

(I'm not affiliated, just a user.)

Based on @Ber's answer, I suggest removing only control characters as defined in the Unicode character database categories:

import unicodedata
def filter_non_printable(s):
    return ''.join(c for c in s if not unicodedata.category(c).startswith('C'))

Adapted from answers by Ants Aasma and shawnrad:

nonprintable = set(map(chr, list(range(0,32)) + list(range(127,160))))
ord_dict = {ord(character):None for character in nonprintable}
def filter_nonprintable(text):
    return text.translate(ord_dict)

str = "this is my string"
str = filter_nonprintable(str)

tested on Python 3.7.7

To remove 'whitespace',

import re
t = """
pat = re.compile(r'[\t\n]')
print(pat.sub("", t))

Questions with python tag:

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation Upgrade to python 3.8 using conda Unable to allocate array with shape and data type How to fix error "ERROR: Command errored out with exit status 1: python." when trying to install django-heroku using pip How to prevent Google Colab from disconnecting? "UserWarning: Matplotlib is currently using agg, which is a non-GUI backend, so cannot show the figure." when plotting figure with pyplot on Pycharm How to fix 'Object arrays cannot be loaded when allow_pickle=False' for imdb.load_data() function? "E: Unable to locate package python-pip" on Ubuntu 18.04 Tensorflow 2.0 - AttributeError: module 'tensorflow' has no attribute 'Session' Jupyter Notebook not saving: '_xsrf' argument missing from post How to Install pip for python 3.7 on Ubuntu 18? Python: 'ModuleNotFoundError' when trying to import module from imported package OpenCV TypeError: Expected cv::UMat for argument 'src' - What is this? Requests (Caused by SSLError("Can't connect to HTTPS URL because the SSL module is not available.") Error in PyCharm requesting website How to setup virtual environment for Python in VS Code? Pylint "unresolved import" error in Visual Studio Code Pandas Merging 101 Numpy, multiply array with scalar What is the meaning of "Failed building wheel for X" in pip install? Selenium: WebDriverException:Chrome failed to start: crashed as google-chrome is no longer running so ChromeDriver is assuming that Chrome has crashed Could not install packages due to an EnvironmentError: [Errno 13] OpenCV !_src.empty() in function 'cvtColor' error ConvergenceWarning: Liblinear failed to converge, increase the number of iterations How to downgrade python from 3.7 to 3.6 I can't install pyaudio on Windows? How to solve "error: Microsoft Visual C++ 14.0 is required."? Iterating over arrays in Python 3 How do I install opencv using pip? How do I install Python packages in Google's Colab? How do I use TensorFlow GPU? How to upgrade Python version to 3.7? How to resolve TypeError: can only concatenate str (not "int") to str How can I install a previous version of Python 3 in macOS using homebrew? Flask at first run: Do not use the development server in a production environment TypeError: only integer scalar arrays can be converted to a scalar index with 1D numpy indices array What is the difference between Jupyter Notebook and JupyterLab? Pytesseract : "TesseractNotFound Error: tesseract is not installed or it's not in your path", how do I fix this? Could not install packages due to a "Environment error :[error 13]: permission denied : 'usr/local/bin/f2py'" How do I resolve a TesseractNotFoundError? Trying to merge 2 dataframes but get ValueError Authentication plugin 'caching_sha2_password' is not supported Python Pandas User Warning: Sorting because non-concatenation axis is not aligned

Questions with string tag:

How to split a string in two and store it in a field String method cannot be found in a main class method Kotlin - How to correctly concatenate a String Replacing a character from a certain index Remove quotes from String in Python Detect whether a Python string is a number or a letter How does String substring work in Swift How does String.Index work in Swift swift 3.0 Data to String? How to parse JSON string in Typescript How to change the datetime format in pandas How to write to a CSV line by line? convert string to number node.js "error: assignment to expression with array type error" when I assign a struct field (C) Remove 'b' character do in front of a string literal in Python 3 Ruby: How to convert a string to boolean What does ${} (dollar sign and curly braces) mean in a string in Javascript? How do I make a new line in swift converting json to string in python PHP - remove all non-numeric characters from a string C# - How to convert string to char? How can I remove the last character of a string in python? Converting std::__cxx11::string to std::string How to convert string to date to string in Swift iOS? Convert time.Time to string TypeError: a bytes-like object is required, not 'str' when writing to a file in Python3 How can I capitalize the first letter of each word in a string using JavaScript? Best way to verify string is empty or null Hive cast string to date dd-MM-yyyy Check for special characters in string How to convert any Object to String? Print "\n" or newline characters as part of the output on terminal Set the maximum character length of a UITextField in Swift How do I convert a Python 3 byte-string variable into a regular string? What does $ mean before a string? Delete the last two characters of the String Splitting a string into separate variables Figure out size of UILabel based on String in Swift Matching strings with wildcard How do I concatenate strings? Print very long string completely in pandas dataframe Check string for nil & empty Convert float to string with precision & number of decimal digits specified? How do I print my Java object without getting "SomeType@2f92e0f4"? enum to string in modern C++11 / C++14 / C++17 and future C++20 How should I remove all the leading spaces from a string? - swift Convert array to JSON string in swift Swift extract regex matches Convert a file path to Uri in Android How would I get everything before a : in a string Python

Questions with non-printable tag:

Stripping non printable characters from a string in python