Select by partial string from a pandas DataFrame

Question

I have a DataFrame with 4 columns of which 2 contain string values  I was wondering if there was a way to select rows based on a partial string match against a particular column   In other words  a function or lambda function that would do something like   re search pattern  cell in question     returning a boolean  I am familiar with the syntax of df df  A       hello world   but can t seem to find a way to do the same with a partial string match say  hello    Would someone be able to point me in the right direction

User · Answer

Here s what I ended up doing for partial string matches   If anyone has a more efficient way of doing this please let me know   def stringSearchColumn DataFrame df  colName  regex       newdf   DataFrame       for idx  record in df colName  iteritems             if re search regex  record               newdf   concat  df df colName     record   newdf   ignore index True       return newdf

User · Answer

I tried the proposed solution above   df df  A   str contains  Hello Britain      and got an error       ValueError  cannot mask with array containing NA   NaN values   you can transform NA values into False  like this   df df  A   str contains  Hello Britain   na False

User · Answer

A more generalised example - if looking for parts of a word OR specific words in a string  df   pd DataFrame    cat andhat   1000 0     hat   2000000 0     the small dog   1000 0     fog   330000 0    pet   330000 0    columns   col1    col2     Specific parts of sentence or word  searchfor      cat  hat     the  dog     Creat column showing the affected rows  can always filter out as necessary  df  quot TrueFalse quot   df  col1   str contains searchfor  regex True       col1             col2           TrueFalse 0   cat andhat       1000 0         True 1   hat              2000000 0      False 2   the small dog    1000 0         True 3   fog              330000 0       False 4   pet 3            30000 0        False

User · Answer

There are answers before this which accomplish the asked feature  anyway I would like to show the most generally way   df filter regex    STRING YOU LOOK FOR       This way let s you get the column you look for whatever the way is wrote     Obviusly  you have to write the proper regex expression for each case

User · Answer

Should you need to do a case insensitive search for a string in a pandas dataframe column    df df  A   str contains  hello   case False

User · Answer

How do I select by partial string from a pandas DataFrame   This post is meant for readers who want to  search for a substring in a string column  the simplest case  search for multiple substrings  similar to isin  match a whole word from text  e g    quot blue quot  should match  quot the sky is blue quot  but not  quot bluejay quot   match multiple whole words Understand the reason behind  quot ValueError  cannot index with vector containing NA   NaN values quot      and would like to know more about what methods should be preferred over others   P S   I ve seen a lot of questions on similar topics  I thought it would be good to leave this here    Friendly disclaimer  this is post is long    Basic Substring Search   setup df1   pd DataFrame   col     foo    foobar    bar    baz     df1        col 0     foo 1  foobar 2     bar 3     baz  str contains can be used to perform either substring searches or regex based search  The search defaults to regex-based unless you explicitly disable it  Here is an example of regex-based search    find rows in  df1  which contain  quot foo quot  followed by something df1 df1  col   str contains r foo                col 1  foobar  Sometimes regex search is not required  so specify regex False to disable it   select all rows containing  quot foo quot  df1 df1  col   str contains  foo   regex False     same as df1 df1  col   str contains  foo    but faster            col 0     foo 1  foobar  Performance wise  regex search is slower than substring search  df2   pd concat  df1    1000  ignore index True    timeit df2 df2  col   str contains  foo     timeit df2 df2  col   str contains  foo   regex False    6 31 ms    126   s per loop  mean    std  dev  of 7 runs  100 loops each  2 8 ms    241   s per loop  mean    std  dev  of 7 runs  100 loops each   Avoid using regex-based search if you don t need it  Addressing ValueErrors Sometimes  performing a substring search and filtering on the result will result in  ValueError  cannot index with vector containing NA   NaN values   This is usually because of mixed data or NaNs in your object column  s   pd Series   foo    foobar   np nan   bar    baz   123   s str contains  foo bar    0     True 1     True 2      NaN 3     True 4    False 5      NaN dtype  object   s s str contains  foo bar      ---------------------------------------------------------------------------   ValueError                                Traceback  most recent call last   Anything that is not a string cannot have string methods applied on it  so the result is NaN  naturally   In this case  specify na False to ignore non-string data  s str contains  foo bar   na False   0     True 1     True 2    False 3     True 4    False 5    False dtype  bool  How do I apply this to multiple columns at once  The answer is in the question  Use DataFrame apply     axis 1  tells  apply  to apply the lambda function column-wise  df apply lambda col  col str contains  foo bar   na False   axis 1          A      B 0   True   True 1   True  False 2  False   True 3   True  False 4  False  False 5  False  False  All of the solutions below can be  quot applied quot  to multiple columns using the column-wise apply method  which is OK in my book  as long as you don t have too many columns   If you have a DataFrame with mixed columns and want to select only the object string columns  take a look at select dtypes   Multiple Substring Search This is most easily achieved through a regex search using the regex OR pipe    Slightly modified example  df4   pd DataFrame   col     foo abc    foobar xyz    bar32    baz 45     df4            col 0     foo abc 1  foobar xyz 2       bar32 3      baz 45  df4 df4  col   str contains r foo baz               col 0     foo abc 1  foobar xyz 3      baz 45  You can also create a list of terms  then join them  terms     foo    baz   df4 df4  col   str contains     join terms               col 0     foo abc 1  foobar xyz 3      baz 45  Sometimes  it is wise to escape your terms in case they have characters that can be interpreted as regex metacharacters  If your terms contain any of the following characters                                 Then  you ll need to use re escape to escape them  import re df4 df4  col   str contains     join map re escape  terms                col 0     foo abc 1  foobar xyz 3      baz 45  re escape has the effect of escaping the special characters so they re treated literally  re escape r  foo          foo       Matching Entire Word s  By default  the substring search searches for the specified substring pattern regardless of whether it is full word or not  To only match full words  we will need to make use of regular expressions here   in particular  our pattern will need to specify word boundaries   b   For example  df3   pd DataFrame   col     the sky is blue    bluejay by the window     df3                       col 0        the sky is blue 1  bluejay by the window    Now consider  df3 df3  col   str contains  blue                          col 0        the sky is blue 1  bluejay by the window  v s df3 df3  col   str contains r  bblue b                    col 0  the sky is blue   Multiple Whole Word Search Similar to the above  except we add a word boundary   b  to the joined pattern  p   r  b       b  format     join map re escape  terms    df4 df4  col   str contains p           col 0  foo abc 3   baz 45  Where p looks like this  p      b   foo baz   b    A Great Alternative  Use List Comprehensions  Because you can  And you should  They are usually a little bit faster than string methods  because string methods are hard to vectorise and usually have loopy implementations  Instead of  df1 df1  col   str contains  foo   regex False    Use the in operator inside a list comp  df1   foo  in x for x in df1  col             col 0  foo abc 1   foobar  Instead of  regex pattern   r foo       df1 df1  col   str contains regex pattern    Use re compile  to cache your regex    Pattern search inside a list comp  p   re compile regex pattern  flags re IGNORECASE  df1  bool p search x   for x in df1  col            col 1  foobar  If  quot col quot  has NaNs  then instead of df1 df1  col   str contains regex pattern  na False    Use  def try search p  x       try          return bool p search x       except TypeError          return False  p   re compile regex pattern  df1  try search p  x  for x in df1  col            col 1  foobar     More Options for Partial String Matching  np char find  np vectorize  DataFrame query  In addition to str contains and list comprehensions  you can also use the following alternatives  np char find Supports substring searches  read  no regex  only  df4 np char find df4  col   values astype str    foo    gt  -1             col 0     foo abc 1  foobar xyz  np vectorize This is a wrapper around a loop  but with lesser overhead than most pandas str methods  f   np vectorize lambda haystack  needle  needle in haystack  f df1  col     foo     array   True   True  False  False    df1 f df1  col     foo            col 0  foo abc 1   foobar  Regex solutions possible  regex pattern   r foo       p   re compile regex pattern  f   np vectorize lambda x  pd notna x  and bool p search x    df1 f df1  col            col 1  foobar  DataFrame query Supports string methods through the python engine  This offers no visible performance benefits  but is nonetheless useful to know if you need to dynamically generate your queries  df1 query  col str contains  quot foo quot     engine  python          col 0     foo 1  foobar  More information on query and eval family of methods can be found at Dynamic Expression Evaluation in pandas using pd eval     Recommended Usage Precedence   First  str contains  for its simplicity and ease handling NaNs and mixed data List comprehensions  for its performance  especially if your data is purely strings  np vectorize  Last  df query

User · Answer

Based on github issue  620  it looks like you ll soon be able to do the following   df df  A   str contains  hello      Update  vectorized string methods  i e   Series str  are available in pandas 0 8 1 and up

User · Answer

Maybe you want to search for some text in all columns of the Pandas dataframe  and not just in the subset of them  In this case  the following code will help   df df apply lambda row  row astype str  str contains  String To Find   any    axis 1     Warning  This method is relatively slow  albeit convenient

User · Answer

Quick note  if you want to do selection based on a partial string contained in the index  try the following   df  stridx   df index df df  stridx   str contains  Hello Britain

User · Answer

Using contains didn t work well for my string with special characters  Find worked though   df df  A   str find  hello      -1

User · Answer

If anyone wonders how to perform a related problem   Select column by partial string    Use   df filter like  hello      select columns which contain the word hello   And to select rows by partial string matching  pass axis 0 to filter     selects rows which contain the word hello in their index label df filter like  hello   axis 0

User · Answer

Say you have the following DataFrame    gt  gt  gt  df   pd DataFrame    hello    hello world      abcd    defg     columns   a   b     gt  gt  gt  df        a            b 0  hello  hello world 1   abcd         defg   You can always use the in operator in a lambda expression to create your filter    gt  gt  gt  df apply lambda x  x  a   in x  b    axis 1  0     True 1    False dtype  bool   The trick here is to use the axis 1 option in the apply to pass elements to the lambda function row by row  as opposed to column by column

[python] Select by partial string from a pandas DataFrame

Examples related to python

Examples related to string

Examples related to pandas

Examples related to dataframe