It seems like there should be a simpler way than:
import string
s = "string. With. Punctuation?" # Sample string
out = s.translate(string.maketrans("",""), string.punctuation)
Is there?
This question is related to
python
string
punctuation
Here's one other easy way to do it using RegEx
import re
punct = re.compile(r'(\w+)')
sentence = 'This ! is : a # sample $ sentence.' # Text with punctuation
tokenized = [m.group() for m in punct.finditer(sentence)]
sentence = ' '.join(tokenized)
print(sentence)
'This is a sample sentence'
This might not be the best solution however this is how I did it.
import string
f = lambda x: ''.join([i for i in x if i not in string.punctuation])
>>> s = "string. With. Punctuation?"
>>> s = re.sub(r'[^\w\s]','',s)
>>> re.split(r'\s*', s)
['string', 'With', 'Punctuation']
string.punctuation
misses loads of punctuation marks that are commonly used in the real world. How about a solution that works for non-ASCII punctuation?
import regex
s = u"string. With. Some·Really Weird?Non?ASCII? ?(Punctuation)??"
remove = regex.compile(ur'[\p{C}|\p{M}|\p{P}|\p{S}|\p{Z}]+', regex.UNICODE)
remove.sub(u" ", s).strip()
Personally, I believe this is the best way to remove punctuation from a string in Python because:
\{S}
if you want to remove punctuation, but keep symbols like $
.\{Pd}
will only remove dashes.This uses Unicode character properties, which you can read more about on Wikipedia.
import re
s = "string. With. Punctuation?" # Sample string
out = re.sub(r'[^a-zA-Z0-9\s]', '', s)
Regular expressions are simple enough, if you know them.
import re
s = "string. With. Punctuation?"
s = re.sub(r'[^\w\s]','',s)
string.punctuation
is ASCII only! A more correct (but also much slower) way is to use the unicodedata module:
# -*- coding: utf-8 -*-
from unicodedata import category
s = u'String — with - «punctation »...'
s = ''.join(ch for ch in s if category(ch)[0] != 'P')
print 'stripped', s
You can generalize and strip other types of characters as well:
''.join(ch for ch in s if category(ch)[0] not in 'SP')
It will also strip characters like ~*+§$
which may or may not be "punctuation" depending on one's point of view.
Here is a function I wrote. It's not very efficient, but it is simple and you can add or remove any punctuation that you desire:
def stripPunc(wordList):
"""Strips punctuation from list of words"""
puncList = [".",";",":","!","?","/","\\",",","#","@","$","&",")","(","\""]
for punc in puncList:
for word in wordList:
wordList=[word.replace(punc,'') for word in wordList]
return wordList
Why none of you use this?
''.join(filter(str.isalnum, s))
Too slow?
import re
s = "string. With. Punctuation?" # Sample string
out = re.sub(r'[^a-zA-Z0-9\s]', '', s)
I usually use something like this:
>>> s = "string. With. Punctuation?" # Sample string
>>> import string
>>> for c in string.punctuation:
... s= s.replace(c,"")
...
>>> s
'string With Punctuation'
Why none of you use this?
''.join(filter(str.isalnum, s))
Too slow?
This might not be the best solution however this is how I did it.
import string
f = lambda x: ''.join([i for i in x if i not in string.punctuation])
Try that one :)
regex.sub(r'\p{P}','', s)
myString.translate(None, string.punctuation)
I usually use something like this:
>>> s = "string. With. Punctuation?" # Sample string
>>> import string
>>> for c in string.punctuation:
... s= s.replace(c,"")
...
>>> s
'string With Punctuation'
Here's one other easy way to do it using RegEx
import re
punct = re.compile(r'(\w+)')
sentence = 'This ! is : a # sample $ sentence.' # Text with punctuation
tokenized = [m.group() for m in punct.finditer(sentence)]
sentence = ' '.join(tokenized)
print(sentence)
'This is a sample sentence'
A one-liner might be helpful in not very strict cases:
''.join([c for c in s if c.isalnum() or c.isspace()])
Remove stop words from the text file using Python
print('====THIS IS HOW TO REMOVE STOP WORS====')
with open('one.txt','r')as myFile:
str1=myFile.read()
stop_words ="not", "is", "it", "By","between","This","By","A","when","And","up","Then","was","by","It","If","can","an","he","This","or","And","a","i","it","am","at","on","in","of","to","is","so","too","my","the","and","but","are","very","here","even","from","them","then","than","this","that","though","be","But","these"
myList=[]
myList.extend(str1.split(" "))
for i in myList:
if i not in stop_words:
print ("____________")
print(i,end='\n')
I usually use something like this:
>>> s = "string. With. Punctuation?" # Sample string
>>> import string
>>> for c in string.punctuation:
... s= s.replace(c,"")
...
>>> s
'string With Punctuation'
Remove stop words from the text file using Python
print('====THIS IS HOW TO REMOVE STOP WORS====')
with open('one.txt','r')as myFile:
str1=myFile.read()
stop_words ="not", "is", "it", "By","between","This","By","A","when","And","up","Then","was","by","It","If","can","an","he","This","or","And","a","i","it","am","at","on","in","of","to","is","so","too","my","the","and","but","are","very","here","even","from","them","then","than","this","that","though","be","But","these"
myList=[]
myList.extend(str1.split(" "))
for i in myList:
if i not in stop_words:
print ("____________")
print(i,end='\n')
For the convenience of usage, I sum up the note of striping punctuation from a string in both Python 2 and Python 3. Please refer to other answers for the detailed description.
Python 2
import string
s = "string. With. Punctuation?"
table = string.maketrans("","")
new_s = s.translate(table, string.punctuation) # Output: string without punctuation
Python 3
import string
s = "string. With. Punctuation?"
table = str.maketrans(dict.fromkeys(string.punctuation)) # OR {key: None for key in string.punctuation}
new_s = s.translate(table) # Output: string without punctuation
I like to use a function like this:
def scrub(abc):
while abc[-1] is in list(string.punctuation):
abc=abc[:-1]
while abc[0] is in list(string.punctuation):
abc=abc[1:]
return abc
string.punctuation
misses loads of punctuation marks that are commonly used in the real world. How about a solution that works for non-ASCII punctuation?
import regex
s = u"string. With. Some·Really Weird?Non?ASCII? ?(Punctuation)??"
remove = regex.compile(ur'[\p{C}|\p{M}|\p{P}|\p{S}|\p{Z}]+', regex.UNICODE)
remove.sub(u" ", s).strip()
Personally, I believe this is the best way to remove punctuation from a string in Python because:
\{S}
if you want to remove punctuation, but keep symbols like $
.\{Pd}
will only remove dashes.This uses Unicode character properties, which you can read more about on Wikipedia.
For Python 3 str
or Python 2 unicode
values, str.translate()
only takes a dictionary; codepoints (integers) are looked up in that mapping and anything mapped to None
is removed.
To remove (some?) punctuation then, use:
import string
remove_punct_map = dict.fromkeys(map(ord, string.punctuation))
s.translate(remove_punct_map)
The dict.fromkeys()
class method makes it trivial to create the mapping, setting all values to None
based on the sequence of keys.
To remove all punctuation, not just ASCII punctuation, your table needs to be a little bigger; see J.F. Sebastian's answer (Python 3 version):
import unicodedata
import sys
remove_punct_map = dict.fromkeys(i for i in range(sys.maxunicode)
if unicodedata.category(chr(i)).startswith('P'))
myString.translate(None, string.punctuation)
with open('one.txt','r')as myFile:
str1=myFile.read()
print(str1)
punctuation = ['(', ')', '?', ':', ';', ',', '.', '!', '/', '"', "'"]
for i in punctuation:
str1 = str1.replace(i," ")
myList=[]
myList.extend(str1.split(" "))
print (str1)
for i in myList:
print(i,end='\n')
print ("____________")
For the convenience of usage, I sum up the note of striping punctuation from a string in both Python 2 and Python 3. Please refer to other answers for the detailed description.
Python 2
import string
s = "string. With. Punctuation?"
table = string.maketrans("","")
new_s = s.translate(table, string.punctuation) # Output: string without punctuation
Python 3
import string
s = "string. With. Punctuation?"
table = str.maketrans(dict.fromkeys(string.punctuation)) # OR {key: None for key in string.punctuation}
new_s = s.translate(table) # Output: string without punctuation
Considering unicode. Code checked in python3.
from unicodedata import category
text = 'hi, how are you?'
text_without_punc = ''.join(ch for ch in text if not category(ch).startswith('P'))
I haven't seen this answer yet. Just use a regex; it removes all characters besides word characters (\w
) and number characters (\d
), followed by a whitespace character (\s
):
import re
s = "string. With. Punctuation?" # Sample string
out = re.sub(ur'[^\w\d\s]+', '', s)
Not necessarily simpler, but a different way, if you are more familiar with the re family.
import re, string
s = "string. With. Punctuation?" # Sample string
out = re.sub('[%s]' % re.escape(string.punctuation), '', s)
I haven't seen this answer yet. Just use a regex; it removes all characters besides word characters (\w
) and number characters (\d
), followed by a whitespace character (\s
):
import re
s = "string. With. Punctuation?" # Sample string
out = re.sub(ur'[^\w\d\s]+', '', s)
#FIRST METHOD
#Storing all punctuations in a variable
punctuation='!?,.:;"\')(_-'
newstring='' #Creating empty string
word=raw_input("Enter string: ")
for i in word:
if(i not in punctuation):
newstring+=i
print "The string without punctuation is",newstring
#SECOND METHOD
word=raw_input("Enter string: ")
punctuation='!?,.:;"\')(_-'
newstring=word.translate(None,punctuation)
print "The string without punctuation is",newstring
#Output for both methods
Enter string: hello! welcome -to_python(programming.language)??,
The string without punctuation is: hello welcome topythonprogramminglanguage
Here's a one-liner for Python 3.5:
import string
"l*ots! o(f. p@u)n[c}t]u[a'ti\"on#$^?/".translate(str.maketrans({a:None for a in string.punctuation}))
I usually use something like this:
>>> s = "string. With. Punctuation?" # Sample string
>>> import string
>>> for c in string.punctuation:
... s= s.replace(c,"")
...
>>> s
'string With Punctuation'
Here's a solution without regex.
import string
input_text = "!where??and!!or$$then:)"
punctuation_replacer = string.maketrans(string.punctuation, ' '*len(string.punctuation))
print ' '.join(input_text.translate(punctuation_replacer).split()).strip()
Output>> where and or then
Considering unicode. Code checked in python3.
from unicodedata import category
text = 'hi, how are you?'
text_without_punc = ''.join(ch for ch in text if not category(ch).startswith('P'))
>>> s = "string. With. Punctuation?"
>>> s = re.sub(r'[^\w\s]','',s)
>>> re.split(r'\s*', s)
['string', 'With', 'Punctuation']
#FIRST METHOD
#Storing all punctuations in a variable
punctuation='!?,.:;"\')(_-'
newstring='' #Creating empty string
word=raw_input("Enter string: ")
for i in word:
if(i not in punctuation):
newstring+=i
print "The string without punctuation is",newstring
#SECOND METHOD
word=raw_input("Enter string: ")
punctuation='!?,.:;"\')(_-'
newstring=word.translate(None,punctuation)
print "The string without punctuation is",newstring
#Output for both methods
Enter string: hello! welcome -to_python(programming.language)??,
The string without punctuation is: hello welcome topythonprogramminglanguage
Just as an update, I rewrote the @Brian example in Python 3 and made changes to it to move regex compile step inside of the function. My thought here was to time every single step needed to make the function work. Perhaps you are using distributed computing and can't have regex object shared between your workers and need to have re.compile
step at each worker. Also, I was curious to time two different implementations of maketrans for Python 3
table = str.maketrans({key: None for key in string.punctuation})
vs
table = str.maketrans('', '', string.punctuation)
Plus I added another method to use set, where I take advantage of intersection function to reduce number of iterations.
This is the complete code:
import re, string, timeit
s = "string. With. Punctuation"
def test_set(s):
exclude = set(string.punctuation)
return ''.join(ch for ch in s if ch not in exclude)
def test_set2(s):
_punctuation = set(string.punctuation)
for punct in set(s).intersection(_punctuation):
s = s.replace(punct, ' ')
return ' '.join(s.split())
def test_re(s): # From Vinko's solution, with fix.
regex = re.compile('[%s]' % re.escape(string.punctuation))
return regex.sub('', s)
def test_trans(s):
table = str.maketrans({key: None for key in string.punctuation})
return s.translate(table)
def test_trans2(s):
table = str.maketrans('', '', string.punctuation)
return(s.translate(table))
def test_repl(s): # From S.Lott's solution
for c in string.punctuation:
s=s.replace(c,"")
return s
print("sets :",timeit.Timer('f(s)', 'from __main__ import s,test_set as f').timeit(1000000))
print("sets2 :",timeit.Timer('f(s)', 'from __main__ import s,test_set2 as f').timeit(1000000))
print("regex :",timeit.Timer('f(s)', 'from __main__ import s,test_re as f').timeit(1000000))
print("translate :",timeit.Timer('f(s)', 'from __main__ import s,test_trans as f').timeit(1000000))
print("translate2 :",timeit.Timer('f(s)', 'from __main__ import s,test_trans2 as f').timeit(1000000))
print("replace :",timeit.Timer('f(s)', 'from __main__ import s,test_repl as f').timeit(1000000))
This is my results:
sets : 3.1830138750374317
sets2 : 2.189873124472797
regex : 7.142953420989215
translate : 4.243278483860195
translate2 : 2.427158243022859
replace : 4.579746678471565
A one-liner might be helpful in not very strict cases:
''.join([c for c in s if c.isalnum() or c.isspace()])
with open('one.txt','r')as myFile:
str1=myFile.read()
print(str1)
punctuation = ['(', ')', '?', ':', ';', ',', '.', '!', '/', '"', "'"]
for i in punctuation:
str1 = str1.replace(i," ")
myList=[]
myList.extend(str1.split(" "))
print (str1)
for i in myList:
print(i,end='\n')
print ("____________")
string.punctuation
is ASCII only! A more correct (but also much slower) way is to use the unicodedata module:
# -*- coding: utf-8 -*-
from unicodedata import category
s = u'String — with - «punctation »...'
s = ''.join(ch for ch in s if category(ch)[0] != 'P')
print 'stripped', s
You can generalize and strip other types of characters as well:
''.join(ch for ch in s if category(ch)[0] not in 'SP')
It will also strip characters like ~*+§$
which may or may not be "punctuation" depending on one's point of view.
Not necessarily simpler, but a different way, if you are more familiar with the re family.
import re, string
s = "string. With. Punctuation?" # Sample string
out = re.sub('[%s]' % re.escape(string.punctuation), '', s)
Try that one :)
regex.sub(r'\p{P}','', s)
I like to use a function like this:
def scrub(abc):
while abc[-1] is in list(string.punctuation):
abc=abc[:-1]
while abc[0] is in list(string.punctuation):
abc=abc[1:]
return abc
For Python 3 str
or Python 2 unicode
values, str.translate()
only takes a dictionary; codepoints (integers) are looked up in that mapping and anything mapped to None
is removed.
To remove (some?) punctuation then, use:
import string
remove_punct_map = dict.fromkeys(map(ord, string.punctuation))
s.translate(remove_punct_map)
The dict.fromkeys()
class method makes it trivial to create the mapping, setting all values to None
based on the sequence of keys.
To remove all punctuation, not just ASCII punctuation, your table needs to be a little bigger; see J.F. Sebastian's answer (Python 3 version):
import unicodedata
import sys
remove_punct_map = dict.fromkeys(i for i in range(sys.maxunicode)
if unicodedata.category(chr(i)).startswith('P'))
Here is a function I wrote. It's not very efficient, but it is simple and you can add or remove any punctuation that you desire:
def stripPunc(wordList):
"""Strips punctuation from list of words"""
puncList = [".",";",":","!","?","/","\\",",","#","@","$","&",")","(","\""]
for punc in puncList:
for word in wordList:
wordList=[word.replace(punc,'') for word in wordList]
return wordList
Here's a one-liner for Python 3.5:
import string
"l*ots! o(f. p@u)n[c}t]u[a'ti\"on#$^?/".translate(str.maketrans({a:None for a in string.punctuation}))
Not necessarily simpler, but a different way, if you are more familiar with the re family.
import re, string
s = "string. With. Punctuation?" # Sample string
out = re.sub('[%s]' % re.escape(string.punctuation), '', s)
Regular expressions are simple enough, if you know them.
import re
s = "string. With. Punctuation?"
s = re.sub(r'[^\w\s]','',s)
Not necessarily simpler, but a different way, if you are more familiar with the re family.
import re, string
s = "string. With. Punctuation?" # Sample string
out = re.sub('[%s]' % re.escape(string.punctuation), '', s)
Just as an update, I rewrote the @Brian example in Python 3 and made changes to it to move regex compile step inside of the function. My thought here was to time every single step needed to make the function work. Perhaps you are using distributed computing and can't have regex object shared between your workers and need to have re.compile
step at each worker. Also, I was curious to time two different implementations of maketrans for Python 3
table = str.maketrans({key: None for key in string.punctuation})
vs
table = str.maketrans('', '', string.punctuation)
Plus I added another method to use set, where I take advantage of intersection function to reduce number of iterations.
This is the complete code:
import re, string, timeit
s = "string. With. Punctuation"
def test_set(s):
exclude = set(string.punctuation)
return ''.join(ch for ch in s if ch not in exclude)
def test_set2(s):
_punctuation = set(string.punctuation)
for punct in set(s).intersection(_punctuation):
s = s.replace(punct, ' ')
return ' '.join(s.split())
def test_re(s): # From Vinko's solution, with fix.
regex = re.compile('[%s]' % re.escape(string.punctuation))
return regex.sub('', s)
def test_trans(s):
table = str.maketrans({key: None for key in string.punctuation})
return s.translate(table)
def test_trans2(s):
table = str.maketrans('', '', string.punctuation)
return(s.translate(table))
def test_repl(s): # From S.Lott's solution
for c in string.punctuation:
s=s.replace(c,"")
return s
print("sets :",timeit.Timer('f(s)', 'from __main__ import s,test_set as f').timeit(1000000))
print("sets2 :",timeit.Timer('f(s)', 'from __main__ import s,test_set2 as f').timeit(1000000))
print("regex :",timeit.Timer('f(s)', 'from __main__ import s,test_re as f').timeit(1000000))
print("translate :",timeit.Timer('f(s)', 'from __main__ import s,test_trans as f').timeit(1000000))
print("translate2 :",timeit.Timer('f(s)', 'from __main__ import s,test_trans2 as f').timeit(1000000))
print("replace :",timeit.Timer('f(s)', 'from __main__ import s,test_repl as f').timeit(1000000))
This is my results:
sets : 3.1830138750374317
sets2 : 2.189873124472797
regex : 7.142953420989215
translate : 4.243278483860195
translate2 : 2.427158243022859
replace : 4.579746678471565
Here's a solution without regex.
import string
input_text = "!where??and!!or$$then:)"
punctuation_replacer = string.maketrans(string.punctuation, ' '*len(string.punctuation))
print ' '.join(input_text.translate(punctuation_replacer).split()).strip()
Output>> where and or then
Source: Stackoverflow.com