Converting a String to a List of Words

Question

I m trying to convert a string to a list of words using python  I want to take something like the following   string    This is a string  with words     Then convert to something like this    list     This    is    a    string    with    words     Notice the omission of punctuation and spaces  What would be the fastest way of going about this

User · Answer

list mystr split     mystr count

User · Answer

The most simple way    gt  gt  gt  import re  gt  gt  gt  string    This is a string  with words    gt  gt  gt  re findall r  w    string    This    is    a    string    with    words

User · Answer

Inspired by  mtrw s answer  but improved to strip out punctuation at word boundaries only   import re import string  def extract words s       return  re sub     0      0      format string punctuation       w  for w in s split      gt  gt  gt  str    This is a string  with words    gt  gt  gt  extract words str    This    is    a    string    with    words     gt  gt  gt  str      I m a custom-built sentence with  tricky  words like https   stackoverflow com       gt  gt  gt  extract words str    I m    a    custom-built    sentence    with    tricky    words    like    https   stackoverflow com

User · Answer

I think this is the simplest way for anyone else stumbling on this post given the late response    gt  gt  gt  string    This is a string  with words    gt  gt  gt  string split     This    is    a    string     with    words

User · Answer

Personally  I think this is slightly cleaner than the answers provided  def split to words sentence       return list filter lambda w  len w   gt  0  re split   W    sentence     Use sentence lower    if needed

User · Answer

To do this properly is quite complex  For your research  it is known as word tokenization  You should look at NLTK if you want to see what others have done  rather than starting from scratch    gt  gt  gt  import nltk  gt  gt  gt  paragraph   u Hi  this is my first sentence  And this is my second    gt  gt  gt  sentences   nltk sent tokenize paragraph   gt  gt  gt  for sentence in sentences          nltk word tokenize sentence   u Hi   u     u this   u is   u my   u first   u sentence   u      u And   u this   u is   u my   u second   u

User · Answer

This is from my attempt on a coding challenge that can t use regex   outputList      join  c if c isalnum   or c      else      for c in inputStr   split        The role of apostrophe seems interesting

User · Answer

Well  you could use  import re list   re sub r                string  split     Note that both string and list are names of builtin types  so you probably don t want to use those as your variable names

User · Answer

This way you eliminate every special char outside of the alphabet   def wordsToList strn       L   strn split       cleanL          abc    abcdefghijklmnopqrstuvwxyz      ABC   abc upper       letters   abc   ABC     for e in L          word              for c in e              if c in letters                  word    c         if word                    cleanL append word      return cleanL  s    She loves you  yea yea yea    L   wordsToList s  print L       She    loves    you    yea    yea    yea     I m not sure if this is fast or optimal or even the right way to program

User · Answer

Using string punctuation for completeness   import re import string x   re sub     string punctuation          s  split     This handles newlines as well

User · Answer

A regular expression for words would give you the most control   You would want to carefully consider how to deal with words with dashes or apostrophes  like  I m

User · Answer

You can try and do this   tryTrans   string maketrans             str    This is a string  with words   str   str translate tryTrans  listOfWords   str split

User · Answer

Try this   import re  mystr    This is a string  with words   wordList   re sub     w          mystr  split     How it works    From the docs    re sub pattern  repl  string  count 0  flags 0    Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl  If the pattern isn   t found  string is returned unchanged  repl can be a string or a function   so in our case     pattern is any non-alphanumeric character     w  means any alphanumeric character and is equal to the character set   a-zA-Z0-9    a to z  A to Z   0 to 9 and underscore    so we match any non-alphanumeric character and replace it with a space     and then we split   it which splits string by space  and converts it to a list   so  hello-world    becomes  hello world   with re sub   and then   hello     world    after split    let me know if any doubts come up

[python] Converting a String to a List of Words?

Examples related to python

Examples related to string

Examples related to list

Examples related to words

Examples related to text-segmentation