This will remove all non-alphanumeric characters except spaces.
string = "Special $#! characters spaces 888323"
''.join(e for e in string if (e.isalnum() or e.isspace()))
Special characters spaces 888323
s = re.sub(r"[-()\"#/@;:<>{}`+=~|.!?,]", "", s)
I think just filter(str.isalnum, string)
works
In [20]: filter(str.isalnum, 'string with special chars like !,#$% etcs.')
Out[20]: 'stringwithspecialcharslikeetcs'
In Python3, filter( )
function would return an itertable object (instead of string unlike in above). One has to join back to get a string from itertable:
''.join(filter(str.isalnum, string))
or to pass list
in join use (not sure but can be fast a bit)
''.join([*filter(str.isalnum, string)])
note: unpacking in [*args]
valid from Python >= 3.5
Shorter way :
import re
cleanString = re.sub('\W+','', string )
If you want spaces between words and numbers substitute '' with ' '
For other languages like German, Spanish, Danish, French etc that contain special characters (like German "Umlaute" as ü
, ä
, ö
) simply add these to the regex search string:
Example for German:
re.sub('[^A-ZÜÖÄa-z0-9]+', '', mystring)
The most generic approach is using the 'categories' of the unicodedata table which classifies every single character. E.g. the following code filters only printable characters based on their category:
import unicodedata
# strip of crap characters (based on the Unicode database
# categorization:
# http://www.sql-und-xml.de/unicode-database/#kategorien
PRINTABLE = set(('Lu', 'Ll', 'Nd', 'Zs'))
def filter_non_printable(s):
result = []
ws_last = False
for c in s:
c = unicodedata.category(c) in PRINTABLE and c or u'#'
result.append(c)
return u''.join(result).replace(u'#', u' ')
Look at the given URL above for all related categories. You also can of course filter by the punctuation categories.
#!/usr/bin/python
import re
strs = "how much for the maple syrup? $20.99? That's ricidulous!!!"
print strs
nstr = re.sub(r'[?|$|.|!]',r'',strs)
print nstr
nestr = re.sub(r'[^a-zA-Z0-9 ]',r'',nstr)
print nestr
you can add more special character and that will be replaced by '' means nothing i.e they will be removed.
import re
my_string = """Strings are amongst the most popular data types in Python. We can create the strings by enclosing characters in quotes. Python treats single quotes the
same as double quotes."""
# if we need to count the word python that ends with or without ',' or '.' at end
count = 0
for i in text:
if i.endswith("."):
text[count] = re.sub("^([a-z]+)(.)?$", r"\1", i)
count += 1
print("The count of Python : ", text.count("python"))
string.punctuation contains following characters:
'!"#$%&\'()*+,-./:;<=>?@[\]^_`{|}~'
You can use translate and maketrans functions to map punctuations to empty values (replace)
import string
'This, is. A test!'.translate(str.maketrans('', '', string.punctuation))
Output:
'This is A test'
Differently than everyone else did using regex, I would try to exclude every character that is not what I want, instead of enumerating explicitly what I don't want.
For example, if I want only characters from 'a to z' (upper and lower case) and numbers, I would exclude everything else:
import re
s = re.sub(r"[^a-zA-Z0-9]","",s)
This means "substitute every character that is not a number, or a character in the range 'a to z' or 'A to Z' with an empty string".
In fact, if you insert the special character ^
at the first place of your regex, you will get the negation.
Extra tip: if you also need to lowercase the result, you can make the regex even faster and easier, as long as you won't find any uppercase now.
import re
s = re.sub(r"[^a-z0-9]","",s.lower())
Use translate:
import string
def clean(instr):
return instr.translate(None, string.punctuation + ' ')
Caveat: Only works on ascii strings.
Assuming you want to use a regex and you want/need Unicode-cognisant 2.x code that is 2to3-ready:
>>> import re
>>> rx = re.compile(u'[\W_]+', re.UNICODE)
>>> data = u''.join(unichr(i) for i in range(256))
>>> rx.sub(u'', data)
u'0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\xaa\xb2 [snip] \xfe\xff'
>>>
After seeing this, I was interested in expanding on the provided answers by finding out which executes in the least amount of time, so I went through and checked some of the proposed answers with timeit
against two of the example strings:
string1 = 'Special $#! characters spaces 888323'
string2 = 'how much for the maple syrup? $20.99? That s ricidulous!!!'
'.join(e for e in string if e.isalnum())
string1
- Result: 10.7061979771string2
- Result: 7.78372597694import re
re.sub('[^A-Za-z0-9]+', '', string)
string1
- Result: 7.10785102844string2
- Result: 4.12814903259import re
re.sub('\W+','', string)
string1
- Result: 3.11899876595string2
- Result: 2.78014397621The above results are a product of the lowest returned result from an average of: repeat(3, 2000000)
Example 3 can be 3x faster than Example 1.
import re
abc = "askhnl#$%askdjalsdk"
ddd = abc.replace("#$%","")
print (ddd)
and you shall see your result as
'askhnlaskdjalsdk
Here is a regex to match a string of characters that are not a letters or numbers:
[^A-Za-z0-9]+
Here is the Python command to do a regex substitution:
re.sub('[^A-Za-z0-9]+', '', mystring)
Source: Stackoverflow.com