A regular expression can be used to offer more control over the whitespace characters that are combined.
To match unicode whitespace:
import re
_RE_COMBINE_WHITESPACE = re.compile(r"\s+")
my_str = _RE_COMBINE_WHITESPACE.sub(" ", my_str).strip()
To match ASCII whitespace only:
import re
_RE_COMBINE_WHITESPACE = re.compile(r"(?a:\s+)")
_RE_STRIP_WHITESPACE = re.compile(r"(?a:^\s+|\s+$)")
my_str = _RE_COMBINE_WHITESPACE.sub(" ", my_str)
my_str = _RE_STRIP_WHITESPACE.sub("", my_str)
Matching only ASCII whitespace is sometimes essential for keeping control characters such as x0b, x0c, x1c, x1d, x1e, x1f.
About \s
:
For Unicode (str) patterns: Matches Unicode whitespace characters (which includes [ \t\n\r\f\v], and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages). If the ASCII flag is used, only [ \t\n\r\f\v] is matched.
About re.ASCII
:
Make \w, \W, \b, \B, \d, \D, \s and \S perform ASCII-only matching instead of full Unicode matching. This is only meaningful for Unicode patterns, and is ignored for byte patterns. Corresponds to the inline flag (?a).
strip()
will remote any leading and trailing whitespaces.