I'm trying to convert a string to a list of words using python. I want to take something like the following:
string = 'This is a string, with words!'
Then convert to something like this :
list = ['This', 'is', 'a', 'string', 'with', 'words']
Notice the omission of punctuation and spaces. What would be the fastest way of going about this?
This question is related to
python
string
list
words
text-segmentation
You can try and do this:
tryTrans = string.maketrans(",!", " ")
str = "This is a string, with words!"
str = str.translate(tryTrans)
listOfWords = str.split()
list=mystr.split(" ",mystr.count(" "))
This is from my attempt on a coding challenge that can't use regex,
outputList = "".join((c if c.isalnum() or c=="'" else ' ') for c in inputStr ).split(' ')
The role of apostrophe seems interesting.
Using string.punctuation
for completeness:
import re
import string
x = re.sub('['+string.punctuation+']', '', s).split()
This handles newlines as well.
This way you eliminate every special char outside of the alphabet:
def wordsToList(strn):
L = strn.split()
cleanL = []
abc = 'abcdefghijklmnopqrstuvwxyz'
ABC = abc.upper()
letters = abc + ABC
for e in L:
word = ''
for c in e:
if c in letters:
word += c
if word != '':
cleanL.append(word)
return cleanL
s = 'She loves you, yea yea yea! '
L = wordsToList(s)
print(L) # ['She', 'loves', 'you', 'yea', 'yea', 'yea']
I'm not sure if this is fast or optimal or even the right way to program.
Inspired by @mtrw's answer, but improved to strip out punctuation at word boundaries only:
import re
import string
def extract_words(s):
return [re.sub('^[{0}]+|[{0}]+$'.format(string.punctuation), '', w) for w in s.split()]
>>> str = 'This is a string, with words!'
>>> extract_words(str)
['This', 'is', 'a', 'string', 'with', 'words']
>>> str = '''I'm a custom-built sentence with "tricky" words like https://stackoverflow.com/.'''
>>> extract_words(str)
["I'm", 'a', 'custom-built', 'sentence', 'with', 'tricky', 'words', 'like', 'https://stackoverflow.com']
I think this is the simplest way for anyone else stumbling on this post given the late response:
>>> string = 'This is a string, with words!'
>>> string.split()
['This', 'is', 'a', 'string,', 'with', 'words!']
A regular expression for words would give you the most control. You would want to carefully consider how to deal with words with dashes or apostrophes, like "I'm".
Personally, I think this is slightly cleaner than the answers provided
def split_to_words(sentence):
return list(filter(lambda w: len(w) > 0, re.split('\W+', sentence))) #Use sentence.lower(), if needed
The most simple way:
>>> import re
>>> string = 'This is a string, with words!'
>>> re.findall(r'\w+', string)
['This', 'is', 'a', 'string', 'with', 'words']
To do this properly is quite complex. For your research, it is known as word tokenization. You should look at NLTK if you want to see what others have done, rather than starting from scratch:
>>> import nltk
>>> paragraph = u"Hi, this is my first sentence. And this is my second."
>>> sentences = nltk.sent_tokenize(paragraph)
>>> for sentence in sentences:
... nltk.word_tokenize(sentence)
[u'Hi', u',', u'this', u'is', u'my', u'first', u'sentence', u'.']
[u'And', u'this', u'is', u'my', u'second', u'.']
Well, you could use
import re
list = re.sub(r'[.!,;?]', ' ', string).split()
Note that both string
and list
are names of builtin types, so you probably don't want to use those as your variable names.
Source: Stackoverflow.com