How to filter Pandas dataframe using in and not in like in SQL

Question

How can I achieve the equivalents of SQL s IN and NOT IN  I have a list with the required values  Here s the scenario  df   pd DataFrame   country     US    UK    Germany    China     countries to keep     UK    China      pseudo-code  df df  country   not in countries to keep   My current way of doing this is as follows  df   pd DataFrame   country     US    UK    Germany    China     df2   pd DataFrame   country     UK    China     matched   True      IN df merge df2  how  inner   on  country      NOT IN not in   df merge df2  how  left   on  country   not in   not in pd isnull not in  matched      But this seems like a horrible kludge  Can anyone improve on it

User · Answer

Alternative solution that uses  query   method   In  5   df query  countries in  countries   Out 5     countries 1        UK 3     China  In  6   df query  countries not in  countries   Out 6     countries 0        US 2   Germany

User · Answer

How to implement  in  and  not in  for a pandas DataFrame    Pandas offers two methods  Series isin and DataFrame isin for Series and DataFrames  respectively     Filter DataFrame Based on ONE Column  also applies to Series   The most common scenario is applying an isin condition on a specific column to filter rows in a DataFrame   df   pd DataFrame   countries     US    UK    Germany   np nan   China     df   countries 0        US 1        UK 2   Germany 3     China  c1     UK    China                 list c2     Germany                     set c3   pd Series   China    US       Series c4   np array   US    UK           array     Series isin accepts various types as inputs  The following are all valid ways of getting what you want   df  countries   isin c1   0    False 1     True 2    False 3    False 4     True Name  countries  dtype  bool     in  operation df df  countries   isin c1      countries 1        UK 4     China     not in  operation df  df  countries   isin c1      countries 0        US 2   Germany 3       NaN       Filter with  set   tuples work too  df df  countries   isin c2      countries 2   Germany       Filter with another Series df df  countries   isin c3      countries 0        US 4     China       Filter with array df df  countries   isin c4      countries 0        US 1        UK     Filter on MANY Columns  Sometimes  you will want to apply an  in  membership check with some search terms over multiple columns   df2   pd DataFrame        A     x    y    z    q     B     w    a   np nan   x     C   np arange 4    df2     A    B  C 0  x    w  0 1  y    a  1 2  z  NaN  2 3  q    x  3  c1     x    w    p     To apply the isin condition to both columns  A  and  B   use DataFrame isin   df2   A    B    isin c1         A      B 0   True   True 1  False  False 2  False  False 3  False   True   From this  to retain rows where at least one column is True  we can use any along the first axis   df2   A    B    isin c1  any axis 1   0     True 1    False 2    False 3     True dtype  bool  df2 df2   A    B    isin c1  any axis 1       A  B  C 0  x  w  0 3  q  x  3   Note that if you want to search every column  you d just omit the column selection step and do   df2 isin c1  any axis 1    Similarly  to retain rows where ALL columns are True  use all in the same manner as before   df2 df2   A    B    isin c1  all axis 1       A  B  C 0  x  w  0     Notable Mentions  numpy isin  query  list comprehensions  string data   In addition to the methods described above  you can also use the numpy equivalent  numpy isin      in  operation df np isin df  countries    c1      countries 1        UK 4     China     not in  operation df np isin df  countries    c1  invert True      countries 0        US 2   Germany 3       NaN   Why is it worth considering  NumPy functions are usually a bit faster than their pandas equivalents because of lower overhead  Since this is an elementwise operation that does not depend on index alignment  there are very few situations where this method is not an appropriate replacement for pandas  isin   Pandas routines are usually iterative when working with strings  because string operations are hard to vectorise  There is a lot of evidence to suggest that list comprehensions will be faster here   We resort to an in check now    c1 set   set c1    Using  in  with  sets  is a constant time operation                        This doesn t matter for pandas because the implementation differs     in  operation df  x in c1 set for x in df  countries        countries 1        UK 4     China     not in  operation df  x not in c1 set for x in df  countries        countries 0        US 2   Germany 3       NaN   It is a lot more unwieldy to specify  however  so don t use it unless you know what you re doing   Lastly  there s also DataFrame query which has been covered in this answer  numexpr FTW

User · Answer

I ve been usually doing generic filtering over rows like this   criterion   lambda row  row  countries   not in countries not in   df df apply criterion  axis 1

User · Answer

df   pd DataFrame   countries    US   UK   Germany   China     countries     UK   China     implement in       df df countries isin countries     implement not in as in of rest countries   df df countries isin  x for x in np unique df countries  if x not in countries

User · Answer

A trick if you want to keep the order of the list  df   pd DataFrame   country     US    UK    Germany    China     countries to keep     Germany    US     ind  df index df  country    i  tolist   for i in countries to keep  flat ind  item for sublist in ind for item in sublist   df reindex flat ind      country 2  Germany 0       US

User · Answer

Collating possible solutions from the answers   For IN  df df  A   isin  3  6     For NOT IN    df -df  A   isin  3  6    df  df  A   isin  3  6    df df  A   isin  3  6      False  df np logical not df  A   isin  3  6

User · Answer

I wanted to filter out dfbc rows that had a BUSINESS ID that was also in the BUSINESS ID of dfProfilesBusIds  dfbc   dfbc  dfbc  BUSINESS ID   isin dfProfilesBusIds  BUSINESS ID

User · Answer

You can use pd Series isin  For  quot IN quot  use  something isin somewhere  Or for  quot NOT IN quot    something isin somewhere  As a worked example  import pandas as pd   gt  gt  gt  df   country 0        US 1        UK 2   Germany 3     China  gt  gt  gt  countries to keep   UK    China    gt  gt  gt  df country isin countries to keep  0    False 1     True 2    False 3     True Name  country  dtype  bool  gt  gt  gt  df df country isin countries to keep     country 1        UK 3     China  gt  gt  gt  df  df country isin countries to keep     country 0        US 2   Germany

[python] How to filter Pandas dataframe using 'in' and 'not in' like in SQL

Examples related to python

Examples related to pandas

Examples related to dataframe

Examples related to sql-function