Find the nth occurrence of substring in a string

Question

This seems like it should be pretty trivial  but I am new at Python and want to do it the most Pythonic way   I want to find the index corresponding to the n th occurrence of a substring within a string   There s got to be something equivalent to what I WANT to do which is   mystring find  substring   2nd   How can you achieve this in Python

User · Answer

Simplest way   text    This is a test from a test ok    firstTest   text find  test    print text find  test   firstTest   1

User · Answer

I d probably do something like this  using the find function that takes an index parameter   def find nth s  x  n       i   -1     for   in range n           i   s find x  i   len x           if i    -1              break     return i  print find nth  bananabanana    an   3    It s not particularly Pythonic I guess  but it s simple  You could do it using recursion instead   def find nth s  x  n  i   0       i   s find x  i      if n    1 or i    -1          return i      else          return find nth s  x  n - 1  i   len x    print find nth  bananabanana    an   3    It s a functional way to solve it  but I don t know if that makes it more Pythonic

User · Answer

Here s a more Pythonic version of the straightforward iterative solution   def find nth haystack  needle  n       start   haystack find needle      while start  gt   0 and n  gt  1          start   haystack find needle  start len needle           n -  1     return start   Example    gt  gt  gt  find nth  foofoofoofoo    foofoo   2  6   If you want to find the nth overlapping occurrence of needle  you can increment by 1 instead of len needle   like this   def find nth overlapping haystack  needle  n       start   haystack find needle      while start  gt   0 and n  gt  1          start   haystack find needle  start 1          n -  1     return start   Example    gt  gt  gt  find nth overlapping  foofoofoofoo    foofoo   2  3   This is easier to read than Mark s version  and it doesn t require the extra memory of the splitting version or importing regular expression module   It also adheres to a few of the rules in the Zen of python  unlike the various re approaches    Simple is better than complex  Flat is better than nested  Readability counts

User · Answer

This will give you an array of the starting indices for matches to yourstring   import re indices    s start   for s in re finditer      yourstring     Then your nth entry would be   n   2 nth entry   indices n-1    Of course you have to be careful with the index bounds  You can get the number of instances of yourstring like this   num instances   len indices

User · Answer

Here s another re   itertools version that should work when searching for either a str or a RegexpObject  I will freely admit that this is likely over-engineered  but for some reason it entertained me   import itertools import re  def find nth haystack  needle  n   1               Find the starting index of the nth occurrence of   needle   in         haystack         If   needle   is a   str    this will perform an exact substring     match  if it is a   RegexpObject    this will perform a regex     search       If   needle   doesn t appear in   haystack    return   -1    If       needle   doesn t appear in   haystack     n   times      return   -1         Arguments     ---------         needle   the substring  or a   RegexpObject    to find         haystack   is a   str         an   int   indicating which occurrence to find  defaults to   1         gt  gt  gt  find nth  foo    o   1      1      gt  gt  gt  find nth  foo    o   2      2      gt  gt  gt  find nth  foo    o   3      -1      gt  gt  gt  find nth  foo    b       -1      gt  gt  gt  import re      gt  gt  gt  either o   re compile   oO         gt  gt  gt  find nth  foo   either o  1      1      gt  gt  gt  find nth  FOO   either o  1      1             if  hasattr needle   finditer             matches   needle finditer haystack      else          matches   re finditer re escape needle   haystack      start here   itertools dropwhile lambda x  x 0   lt  n  enumerate matches  1       try          return next start here  1  start       except StopIteration          return -1

User · Answer

gt  gt  gt  s  abcdefabcdefababcdef   gt  gt  gt  j 0  gt  gt  gt  for n i in enumerate s         if s n n 2     ab           print n i         j j 1         if j  2  print  2nd occurence at index position    n     0 a 6 a 2nd occurence at index position   6 12 a 14 a

User · Answer

Def   def get first N words mytext  mylen   3       mylist   list mytext split        if len mylist  gt  mylen  return     join mylist  mylen     To use   get first N words    One Two Three Four     3    Output    One Two Three

User · Answer

Mark s iterative approach would be the usual way  I think   Here s an alternative with string-splitting  which can often be useful for finding-related processes   def findnth haystack  needle  n       parts  haystack split needle  n 1      if len parts  lt  n 1          return -1     return len haystack -len parts -1  -len needle    And here s a quick  and somewhat dirty  in that you have to choose some chaff that can t match the needle  one-liner    foo bar bar bar  replace  bar    XXX   1  find  bar

User · Answer

Building on modle13 s answer  but without the re module dependency   def iter find haystack  needle       return  i for i in range 0  len haystack   if haystack i   startswith needle     I kinda wish this was a builtin string method    gt  gt  gt  iter find  http   stackoverflow com questions 1883980          5  6  24  34  42

User · Answer

Here is another approach using re finditer  The difference is that this only looks into the haystack as far as necessary  from re import finditer from itertools import dropwhile needle  an  haystack  bananabanana  n 2 next dropwhile lambda x  x 0  lt n  enumerate re finditer needle haystack     1  start

User · Answer

return -1 if nth substr  0-indexed  d n e  else return index def find nth s  substr  n       i   0     while n  gt   0          n -  1         i   s find substr  i   1      return i

User · Answer

For the special case where you search for the n th occurence of a character  i e  substring of length 1   the following function works by building a list of all positions of occurences of the given character   def find char nth string  char  n          Find the n th occurence of a character within a string         return  i for i  c in enumerate string  if c    char  n-1    If there are fewer than n occurences of the given character  it will give IndexError  list index out of range   This is derived from  Zv oDD s answer and simplified for the case of a single character

User · Answer

How about   c   os getcwd   split       print      join c 0 -2

User · Answer

I m offering some benchmarking results comparing the most prominent approaches presented so far  namely  bobince s findnth    based on str split    vs   tgamblin s or  Mark Byers  find nth    based on str find     I will also compare with a C extension   find nth so  to see how fast we can go  Here is find nth py    def findnth haystack  needle  n       parts  haystack split needle  n 1      if len parts  lt  n 1          return -1     return len haystack -len parts -1  -len needle   def find nth s  x  n 0  overlap False       l   1 if overlap else len x      i   -l     for c in xrange n   1           i   s find x  i   l          if i  lt  0              break     return i   Of course  performance matters most if the string is large  so suppose we want to find the 1000001st newline    n   in a 1 3 GB file called  bigfile   To save memory  we would like to work on an mmap mmap object representation of the file   In  1   import  find nth  find nth  mmap  In  2   f   open  bigfile    r    In  3   mm   mmap mmap f fileno    0  access mmap ACCESS READ    There is already the first problem with findnth    since mmap mmap objects don t support split    So we actually have to copy the whole file into memory   In  4    time s   mm    CPU times  user 813 ms  sys  3 25 s  total  4 06 s Wall time  17 7 s   Ouch  Fortunately s still fits in the 4 GB of memory of my Macbook Air  so let s benchmark findnth     In  5    timeit find nth findnth s    n   1000000  1 loops  best of 3  29 9 s per loop   Clearly a terrible performance  Let s see how the approach based on str find   does   In  6    timeit find nth find nth s    n   1000000  1 loops  best of 3  774 ms per loop   Much better  Clearly  findnth   s problem is that it is forced to copy the string during split    which is already the second time we copied the 1 3 GB of data around after s   mm     Here comes in the second advantage of find nth    We can use it on mm directly  such that zero copies of the file are required   In  7    timeit find nth find nth mm    n   1000000  1 loops  best of 3  1 21 s per loop   There appears to be a small performance penalty operating on mm vs  s  but this illustrates that find nth   can get us an answer in 1 2 s compared to findnth s total of 47 s   I found no cases where the str find   based approach was significantly worse than the str split   based approach  so at this point  I would argue that  tgamblin s or  Mark Byers  answer should be accepted instead of  bobince s   In my testing  the version of find nth   above was the fastest pure Python solution I could come up with  very similar to  Mark Byers  version   Let s see how much better we can do with a C extension module  Here is  find nthmodule c    include  lt Python h gt   include  lt string h gt   off t  find nth const char  buf  size t l  char c  int n        off t i      for  i   0  i  lt  l    i            if  buf i     c  amp  amp  n--    0                return i                      return -1     off t  find nth2 const char  buf  size t l  char c  int n        const char  b   buf - 1      do           b   memchr b   1  c  l           if   b  return -1        while  n--       return b - buf        mmap object is private in mmapmodule c - replicate beginning here    typedef struct       PyObject HEAD     char  data      size t size    mmap object   typedef struct       const char  s      size t l      char c      int n    params   int parse args PyObject  args  params  P        PyObject  obj      const char  x       if   PyArg ParseTuple args   Osi    amp obj   amp x   amp P- gt n             return 1            PyTypeObject  type   Py TYPE obj        if  type     amp PyString Type            P- gt s   PyString AS STRING obj           P- gt l   PyString GET SIZE obj         else if   strcmp type- gt tp name   mmap mmap              mmap object  m obj    mmap object   obj          P- gt s   m obj- gt data          P- gt l   m obj- gt size        else           PyErr SetString PyExc TypeError   Cannot obtain char   from argument 0            return 1            P- gt c   x 0       return 0     static PyObject  py find nth PyObject  self  PyObject  args        params P      if   parse args args   amp P             return Py BuildValue  i    find nth P s  P l  P c  P n          else           return NULL               static PyObject  py find nth2 PyObject  self  PyObject  args        params P      if   parse args args   amp P             return Py BuildValue  i    find nth2 P s  P l  P c  P n          else           return NULL               static PyMethodDef methods             find nth   py find nth  METH VARARGS             find nth2   py find nth2  METH VARARGS            0      PyMODINIT FUNC init find nth void        Py InitModule   find nth   methods       Here is the setup py file   from distutils core import setup  Extension module   Extension   find nth   sources    find nthmodule c    setup ext modules  module     Install as usual with python setup py install  The C code plays at an advantage here since it is limited to finding single characters  but let s see how fast this is   In  8    timeit  find nth find nth mm    n   1000000  1 loops  best of 3  218 ms per loop  In  9    timeit  find nth find nth s    n   1000000  1 loops  best of 3  216 ms per loop  In  10    timeit  find nth find nth2 mm    n   1000000  1 loops  best of 3  307 ms per loop  In  11    timeit  find nth find nth2 s    n   1000000  1 loops  best of 3  304 ms per loop   Clearly quite a bit faster still  Interestingly  there is no difference on the C level between the in-memory and mmapped cases  It is also interesting to see that  find nth2    which is based on string h s memchr   library function  loses out against the straightforward implementation in  find nth    The additional  optimizations  in memchr   are apparently backfiring     In conclusion  the implementation in findnth    based on str split    is really a bad idea  since  a  it performs terribly for larger strings due to the required copying  and  b   it doesn t work on mmap mmap objects at all  The implementation in find nth    based on str find    should be preferred in all circumstances  and therefore be the accepted answer to this question    There is still quite a bit of room for improvement  since the C extension ran almost a factor of 4 faster than the pure Python code  indicating that there might be a case for a dedicated Python library function

User · Answer

Understanding that regex is not always the best solution  I d probably use one here    gt  gt  gt  import re  gt  gt  gt  s    ababdfegtduab   gt  gt  gt   m start   for m in re finditer r ab  s    0  2  11   gt  gt  gt   m start   for m in re finditer r ab  s   2   index 2 is third occurrence  11

User · Answer

Here is my solution for finding nth occurrance of b in string a   from functools import reduce   def findNth a  b  n       return reduce lambda x  y  -1 if y  gt  x   1 else a find b  x   1   range n   -1    It is pure Python and iterative  For 0 or n that is too large  it returns -1  It is one-liner and can be used directly  Here is an example    gt  gt  gt  reduce lambda x  y  -1 if y  gt  x   1 else  bibarbobaobaotang  find  b   x   1   range 4   -1  7

User · Answer

Solution without using loops and recursion      Use the required pattern in compile method and enter the desired   occurrence in variable  n  and the last statement will print the   starting index of the nth occurrence of the pattern in the given   string  Here the result of finditer i e  iterator is being converted   to list and directly accessing the nth index    import re n 2 sampleString  this is history  pattern re compile  is   matches pattern finditer sampleString  print list matches  n  span   0

User · Answer

Providing another  tricky  solution  which use split and join   In your example  we can use  len  substring  join  s for s in ori split  substring    2

User · Answer

The replace one liner is great but only works because XX and bar have the same lentgh  A good and general def would be   def findN s sub N replaceString  XXX        return s replace sub replaceString N-1  find sub  -  len replaceString -len sub    N-1

User · Answer

This is the answer you really want   def Find String ToFind Occurence   1   index   0  count   0 while index  lt   len String       try          if String index index   len ToFind      ToFind              count    1         if count    Occurence                 return index                break         index    1     except IndexError          return False         break return False

User · Answer

This will find the second occurrence of substring in string   def find 2nd string  substring      return string find substring  string find substring    1    Edit  I haven t thought much about the performance  but a quick recursion can help with finding the nth occurrence   def find nth string  substring  n      if  n    1          return string find substring     else         return string find substring  find nth string  substring  n - 1    1

User · Answer

Avoid a failure or incorrect output when the input value for occurrence provided is higher than the actual count of occurrence  For example  in a string  overflow  if you would check the 3rd occurrence of  o    it has only 2 occurrences   then below code will return a warning or message indicating that the occurrence value has exceeded  Input Occurrence entered has exceeded the actual count of Occurrence  def check nth occurrence  string  substr  n       Count the Occurrence of a substr     cnt   0     for i in string          if i   substr              cnt   cnt   1         else              pass     Check if the Occurrence input has exceeded the actual count of Occurrence      if n  gt  cnt          print  f  Input Occurrence entered has exceeded the actual count of Occurrence           return     Get the Index value for first Occurrence of the substr     index   string find substr      Get the Index value for nth Occurrence of Index     while index  gt   0 and n  gt  1          index   string find substr  index  1          n -  1   return index

[python] Find the nth occurrence of substring in a string

Examples related to python

Examples related to string

Examples related to substring