Split Strings into words with multiple word boundary delimiters

Question

I think what I want to do is a fairly common task but I ve found no reference on the web  I have text with punctuation  and I want a list of the words     Hey  you - what are you doing here      should be    hey    you    what    are    you    doing    here     But Python s str split   only works with one argument  so I have all words with the punctuation after I split with whitespace  Any ideas

User · Answer

Use replace two times   a    11223FROM33344INTO33222FROM3344  a replace  FROM          replace  INTO          split          results in      11223    33344    33222    3344

User · Answer

I like re  but here is my solution without it   from itertools import groupby sep      -    s    Hey  you - what are you doing here    print     join g  for k  g in groupby s  sep   contains    if not k    sep   contains   is a method used by  in  operator  Basically it is the same as  lambda ch  ch in sep   but is more convenient here   groupby gets our string and function  It splits string in groups using that function   whenever a value of function changes - a new group is generated  So  sep   contains   is exactly what we need   groupby returns a sequence of pairs  where pair 0  is a result of our function and pair 1  is a group  Using  if not k  we filter out groups with separators  because a result of sep   contains   is True on separators   Well  that s all - now we have a sequence of groups where each one is a word  group is actually an iterable so we use join to convert it to string    This solution is quite general  because it uses a function to separate string  you can split by any condition you need   Also  it doesn t create intermediate strings lists  you can remove join and the expression will become lazy  since each group is an iterator

User · Answer

I like pprzemek s solution because it does not assume that the delimiters are single characters and it doesn t try to leverage a regex  which would not work well if the number of separators got to be crazy long    Here s a more readable version of the above solution for clarity   def split string on multiple separators input string  separators       buffer    input string      for sep in separators          strings   buffer         buffer         reset the buffer         for s in strings              buffer   buffer   s split sep       return buffer

User · Answer

join   lambda x  sum x        a k a  flatten1   1   2 3   4    - gt   1 2 3 4       alternatively    join   lambda lists   x for l in lists for x in l    Then this becomes a three-liner   fragments    text  for token in tokens      fragments   join f split token  for f in fragments      Explanation  This is what in Haskell is known as the List monad  The idea behind the monad is that once  in the monad  you  stay in the monad  until something takes you out  For example in Haskell  say you map the python range n  - gt   1 2     n  function over a List  If the result is a List  it will be append to the List in-place  so you d get something like map range   3 4 1   - gt   0 1 2 0 1 2 3 0   This is known as map-append  or mappend  or maybe something like that   The idea here is that you ve got this operation you re applying  splitting on a token   and whenever you do that  you join the result into the list   You can abstract this into a function and have tokens string punctuation by default    Advantages of this approach    This approach  unlike naive regex-based approaches  can work with arbitrary-length tokens  which regex can also do with more advanced syntax   You are not restricted to mere tokens  you could have arbitrary logic in place of each token  for example one of the  tokens  could be a function which splits according to how nested parentheses are

User · Answer

try this   import re  phrase    Hey  you - what are you doing here    matches   re findall   w    phrase  print matches   this will print   Hey    you    what    are    you    doing    here

User · Answer

Pro-Tip  Use string translate for the fastest string operations Python has   Some proof     First  the slow way  sorry pprzemek     gt  gt  gt  import timeit  gt  gt  gt  S    Hey  you - what are you doing here     gt  gt  gt  def my split s  seps           res    s          for sep in seps              s  res   res                 for seq in s                  res    seq split sep          return res       gt  gt  gt  timeit Timer  my split S  punctuation     from   main   import S my split  from string import punctuation   timeit   54 65477919578552   Next  we use re findall    as given by the suggested answer   MUCH faster    gt  gt  gt  timeit Timer  findall r  w    S     from   main   import S  from re import findall   timeit   4 194725036621094   Finally  we use translate    gt  gt  gt  from string import translate maketrans punctuation   gt  gt  gt  T   maketrans punctuation      len punctuation    gt  gt  gt  timeit Timer  translate S  T  split      from   main   import S T translate   timeit   1 2835021018981934   Explanation   string translate is implemented in C and unlike many string manipulation functions in Python  string translate does not produce a new string  So it s about as fast as you can get for string substitution   It s a bit awkward  though  as it needs a translation table in order to do this magic  You can make a translation table with the maketrans   convenience function  The objective here is to translate all unwanted characters to spaces  A one-for-one substitute  Again  no new data is produced  So this is fast   Next  we use good old split    split   by default will operate on all whitespace characters  grouping them together for the split  The result will be the list of words that you want  And this approach is almost 4x faster than re findall

User · Answer

So many answers  yet I can t find any solution that does efficiently what the title of the questions literally asks for  splitting on multiple possible separators   instead  many answers split on anything that is not a word  which is different   So here is an answer to the question in the title  that relies on Python s standard and efficient re module   gt  gt  gt  import re    Will be splitting on     lt space gt  -        gt  gt  gt  filter None  re split  quot     -      quot    quot Hey  you - what are you doing here   quot      Hey    you    what    are    you    doing    here    where   the       matches one of the separators listed inside  the  - in the regular expression is here to prevent the special interpretation of - as a character range indicator  as in A-Z   the   skips one or more delimiters  it could be omitted thanks to the filter    but this would unnecessarily produce empty strings between matched single-character separators   and filter None       removes the empty strings possibly created by leading and trailing separators  since empty strings have a false boolean value    This re split   precisely  quot splits with multiple separators quot   as asked for in the question title  This solution is furthermore immune to the problems with non-ASCII characters in words found in some other solutions  see the first comment to ghostdog74 s answer   The re module is much more efficient  in speed and concision  than doing Python loops and tests  quot by hand quot

User · Answer

Another way to achieve this is to use the Natural Language Tool Kit  nltk    import nltk data   Hey  you - what are you doing here    word tokens   nltk tokenize regexp tokenize data  r  w    print word tokens   This prints    Hey    you    what    are    you    doing    here    The biggest drawback of this method is that you need to install the nltk package   The benefits are that you can do a lot of fun stuff with the rest of the nltk package once you get your tokens

User · Answer

Another quick way to do this without a regexp is to replace the characters first  as below    gt  gt  gt   a bcd ef g  replace          replace          split     a    bcd    ef    g

User · Answer

Instead of using a re module function re split you can achieve the same result using the series str split method of pandas    First  create a series with the above string and then apply the method to the series   thestring   pd Series  Hey  you - what are you doing here     thestring str split pat      -    parameter pat takes the delimiters and returns the split string as an array  Here the two delimiters are passed using a    or operator   The output is as follows    Hey   you    what are you doing here

User · Answer

Heres my take on it      def split string source splitlist       splits   frozenset splitlist      l          s1          for c in source          if c in splits              if s1                  l append s1                  s1              else              print s1             s1   s1   c     if s1          l append s1      return l   gt  gt  gt out   split string  First Name Last Name Street Address City State Zip Code        gt  gt  gt print out  gt  gt  gt   First Name    Last Name    Street Address    City    State    Zip Code

User · Answer

I recently needed to do this but wanted a function that somewhat matched the standard library str split function  this function behaves the same as standard library when called with 0 or 1 arguments    def split many string   separators       if len separators     0          return string split       if len separators   gt  1          table                 ord separator   ord separator 0               for separator in separators                   string   string translate table      return string split separators 0     NOTE  This function is only useful when your separators consist of a single character  as was my usecase

User · Answer

First of all  I don t think that your intention is to actually use punctuation as delimiters in the split functions   Your description suggests that you simply want to eliminate punctuation from the resultant strings   I come across this pretty frequently  and my usual solution doesn t require re   One-liner lambda function w  list comprehension    requires import string    split without punc   lambda text    word strip string punctuation  for word in      text split   if word strip string punctuation            Call function split without punc  Hey  you -- what are you doing       returns   Hey    you    what    are    you    doing       Function  traditional   As a traditional function  this is still only two lines with a list comprehension  in addition to import string    def split without punctuation2 text          Split by whitespace     words   text split          Strip punctuation from each word     return  word strip ignore  for word in words if word strip ignore          split without punctuation2  Hey  you -- what are you doing       returns   Hey    you    what    are    you    doing     It will also naturally leave contractions and hyphenated words intact  You can always use text replace  -        to turn hyphens into spaces before the split   General Function w o Lambda or List Comprehension  For a more general solution  where you can specify the characters to eliminate   and without a list comprehension  you get   def split without text  str  ignore  str  - gt  list         Split by whitespace     split string   text split          Strip any characters in the ignore string  and ignore empty strings     words          for word in split string          word   word strip ignore          if word                    words append word       return words    Situation-specific call to general function import string final text   split without  Hey  you - what are you doing     string punctuation    returns   Hey    you    what    are    you    doing     Of course  you can always generalize the lambda function to any specified string of characters as well

User · Answer

I m re-acquainting myself with Python and needed the same thing  The findall solution may be better  but I came up with this   tokens    x strip   for x in data split

User · Answer

In Python 3  your can use the method from PY4E - Python for Everybody      We can solve both these problems by using the string methods lower  punctuation  and translate  The translate is the most subtle of the methods  Here is the documentation for translate    your string translate your string maketrans fromstr  tostr  deletestr       Replace the characters in fromstr with the character in the same position in tostr and delete all characters that are in deletestr  The fromstr and tostr can be empty strings and the deletestr parameter can be omitted    Your can see the  punctuation    In  10   import string  In  11   string punctuation Out 11          amp        -     lt   gt                    For your example   In  12   your str    Hey  you - what are you doing here     In  13   line   your str translate your str maketrans         string punctuation    In  14   line   line lower    In  15   words   line split    In  16   print words    hey    you    what    are    you    doing    here     For more information  you can refer    PY4E - Python for Everybody str translate str maketrans Python String maketrans   Method

User · Answer

I had to come up with my own solution since everything I ve tested so far failed at some point    gt  gt  gt  import re  gt  gt  gt  def split words text           rgx   re compile r        lt     w     w-        lt  -         lt     w     w-                       return rgx findall text    It seems to be working fine  at least for the examples below    gt  gt  gt  split words  The hill-tops gleam in morning s spring      The    hill-tops    gleam    in    morning s    spring    gt  gt  gt  split words  I d say it s James   time       I d    say    it s    James     time    gt  gt  gt  split words  tic-tac-toe s tic-tac-toe ll tic-tac tic-tac we ll--if tic-tac     tic-tac-toe s    tic-tac-toe ll    tic-tac tic-tac    we ll    if    tic-tac    gt  gt  gt  split words  google com email google com split words     google    com    email    google    com    split words    gt  gt  gt  split words  Kurt Friedrich G  del    g  rd l   2  German    k   t  g   dl    listen       Kurt    Friedrich    G  del     g  rd l    2    German     k      t     g   dl    listen    gt  gt  gt  split words  April 28  1906     January 14  1978  was an Austro-Hungarian-born Austrian        April    28    1906    January    14    1978    was    an    Austro-Hungarian-born    Austrian

User · Answer

Here is the answer with some explanation   st    Hey  you - what are you doing here       replace all the non alpha-numeric with space and then join  new string      join  x replace x       if not x isalnum   else x for x in st     output of new string  Hey  you  what are you doing here       str split   will remove all the empty string if separator is not provided new list   new string split      output of new list   Hey    you    what    are    you    doing    here      we can join it to get a complete string without any non alpha-numeric character     join new list    output  Hey you what are you doing    or in one line  we can do like this       join  x replace x       if not x isalnum   else x for x in st    split      output   Hey    you    what    are    you    doing    here     updated answer

User · Answer

I think the following is the best answer to suite your needs     W  maybe suitable for this case  but may not be suitable for other cases   filter None  re compile        -        split   Hey  you - what are you doing here

User · Answer

First of all  always use re compile   before performing any RegEx operation in a loop because it works faster than normal operation   so for your problem first compile the pattern and then perform action on it   import re DATA    Hey  you - what are you doing here    reg tok   re compile    w      print reg tok findall DATA

User · Answer

I like the replace   way the best  The following procedure changes all separators defined in a string splitlist to the first separator in splitlist and then splits the text on that one separator  It also accounts for if splitlist happens to be an empty string  It returns a list of words  with no empty strings in it   def split string text  splitlist       for sep in splitlist          text   text replace sep  splitlist 0       return filter None  text split splitlist 0    if splitlist else  text

User · Answer

def get words s       l          w          for c in s lower            if c in  -                    if w                         l append w              w              else              w   w   c     if w                 l append w      return l   Here is the usage    gt  gt  gt  s    Hey  you - what are you doing here     gt  gt  gt  print get words s    hey    you    what    are    you    doing    here

User · Answer

I had a similar dilemma and didn t want to use  re  module   def my split s  seps       res    s      for sep in seps          s  res   res             for seq in s              res    seq split sep      return res  print my split  1111  2222 3333 4444 5555 6666                      1111        2222    3333    4444    5555    6666

User · Answer

re split    re split pattern  string   maxsplit 0   Split string by the occurrences of pattern  If capturing parentheses are used in pattern  then the text of all groups in the pattern are also returned as part of the resulting list  If maxsplit is nonzero  at most maxsplit splits occur  and the remainder of the string is returned as the final element of the list   Incompatibility note  in the original Python 1 5 release  maxsplit was ignored  This has been fixed in later releases     gt  gt  gt  re split   W     Words  words  words      Words    words    words        gt  gt  gt  re split    W      Words  words  words      Words          words          words             gt  gt  gt  re split   W     Words  words  words    1    Words    words  words

User · Answer

First  I want to agree with others that the regex or str translate      based solutions are most performant   For my use case the performance of this function wasn t significant  so I wanted to add ideas that I considered with that criteria   My main goal was to generalize ideas from some of the other answers into one solution that could work for strings containing more than just regex words  i e   blacklisting the explicit subset of punctuation characters vs whitelisting word characters    Note that  in any approach  one might also consider using string punctuation in place of a manually defined list   Option 1 - re sub  I was surprised to see no answer so far uses re sub        I find it a simple and natural approach to this problem   import re  my str    Hey  you - what are you doing here     words   re split r  s    re sub r    -           my str  strip      In this solution  I nested the call to re sub      inside re split          but if performance is critical  compiling the regex outside could be beneficial     for my use case  the difference wasn t significant  so I prefer simplicity and readability   Option 2 - str replace  This is a few more lines  but it has the benefit of being expandable without having to check whether you need to escape a certain character in regex   my str    Hey  you - what are you doing here     replacements          -             for r in replacements      my str   my str replace r        words   my str split     It would have been nice to be able to map the str replace to the string instead  but I don t think it can be done with immutable strings  and while mapping against a list of characters would work  running every replacement against every character sounds excessive   Edit  See next option for a functional example    Option 3 - functools reduce   In Python 2  reduce is available in global namespace without importing it from functools    import functools  my str    Hey  you - what are you doing here     replacements          -             my str   functools reduce lambda s  sep  s replace sep        replacements  my str  words   my str split

User · Answer

Another way  without regex  import string punc   string punctuation thestring    Hey  you - what are you doing here    s   list thestring     join  o for o in s if not o in punc   split

User · Answer

A case where regular expressions are justified   import re DATA    Hey  you - what are you doing here    print re findall r   w      DATA    Prints   Hey    you    what    are    you    doing    here

User · Answer

Here is my go at a split with multiple deliminaters   def msplit  str  delims      w        for z in str      if z not in delims          w    z     else          if len w   gt  0               yield w         w        if len w   gt  0       yield w

User · Answer

If you want a reversible operation  preserve the delimiters   you can use this function   def tokenizeSentence Reversible sentence       setOfDelimiters                                      listOfTokens    sentence       for delimiter in setOfDelimiters          newListOfTokens              for ind  token in enumerate listOfTokens               ll      delimiter  w  if ind  gt  0 else  w   for ind  w in enumerate token split delimiter                listOfTokens    item for sublist in ll for item in sublist    flattens              listOfTokens   filter None  listOfTokens    Removes empty tokens                 newListOfTokens extend listOfTokens           listOfTokens   newListOfTokens      return listOfTokens

User · Answer

got same problem as  ooboo and find this topic  ghostdog74 inspired me  maybe someone finds my solution usefull  str1  adj sg nom m1 m2 m3 pos  splitat         join   s if s not in splitat else     for s in str1   split     input something in space place and split using same character if you dont want to split at spaces

User · Answer

using maketrans and translate you can do it easily and neatly  import string specials              lt  gt      -   trans   string maketrans specials      len specials   body   body translate trans  words   body strip   split

User · Answer

Create a function that takes as input two strings  the source string to be split and the splitlist string of delimiters  and outputs a list of split words   def split string source  splitlist       output         output list of cleaned words     atsplit   True     for char in source          if char in splitlist              atsplit   True         else              if atsplit                  output append char     append new word after split                 atsplit   False             else                   output -1    output -1    char    continue copying characters until next split     return output

[python] Split Strings into words with multiple word boundary delimiters

Examples related to python

Examples related to string

Examples related to split