pandas get rows which are NOT in other dataframe

Question

I ve two pandas data frames that have some rows in common  Suppose dataframe2 is a subset of dataframe1  How can I get the rows of dataframe1 which are not in dataframe2  df1   pandas DataFrame data     col1     1  2  3  4  5    col2     10  11  12  13  14     df2   pandas DataFrame data     col1     1  2  3    col2     10  11  12     df1    col1  col2 0     1    10 1     2    11 2     3    12 3     4    13 4     5    14  df2    col1  col2 0     1    10 1     2    11 2     3    12  Expected result     col1  col2 3     4    13 4     5    14

User · Answer

extract the dissimilar rows using the merge function  df   df merge same drop duplicates    on   col1   col2                    how  left   indicator True    save the dissimilar rows in CSV  df df   merge       left only   to csv  output csv

User · Answer

My way of doing this involves adding a new column that is unique to one dataframe and using this to choose whether to keep an entry  df2 col3    1 df1   pd merge df 1  df 2  on   field x    field y    how    outer   df1  Empt   fillna 0  inplace True    This makes it so every entry in df1 has a code - 0 if it is unique to df1  1 if it is in both dataFrames  You then use this to restrict to what you want  answer   nonuni nonuni  Empt      0

User · Answer

The currently selected solution produces incorrect results  To correctly solve this problem  we can perform a left-join from df1 to df2  making sure to first get just the unique rows for df2    First  we need to modify the original DataFrame to add the row with data  3  10    df1   pd DataFrame data     col1     1  2  3  4  5  3                                col2     10  11  12  13  14  10     df2   pd DataFrame data     col1     1  2  3                               col2     10  11  12     df1     col1  col2 0     1    10 1     2    11 2     3    12 3     4    13 4     5    14 5     3    10  df2     col1  col2 0     1    10 1     2    11 2     3    12   Perform a left-join  eliminating duplicates in df2 so that each row of df1 joins with exactly 1 row of df2  Use the parameter indicator to return an extra column indicating which table the row was from   df all   df1 merge df2 drop duplicates    on   col1   col2                        how  left   indicator True  df all     col1  col2      merge 0     1    10       both 1     2    11       both 2     3    12       both 3     4    13  left only 4     5    14  left only 5     3    10  left only   Create a boolean condition   df all   merge       left only   0    False 1    False 2    False 3     True 4     True 5     True Name   merge  dtype  bool     Why other solutions are wrong  A few solutions make the same mistake - they only check that each value is independently in each column  not together in the same row  Adding the last row  which is unique but has the values from both columns from df2 exposes the mistake   common   df1 merge df2 on   col1   col2      df1 col1 isin common col1   amp   df1 col2 isin common col2   0    False 1    False 2    False 3     True 4     True 5    False dtype  bool   This solution gets the same wrong result   df1 isin df2 to dict  l    all 1

User · Answer

Assuming that the indexes are consistent in the dataframes  not taking into account the actual col values    df1  df1 index isin df2 index

User · Answer

Here is another way of solving this   df1  df1 index isin df1 merge df2  how  inner   on   col1    col2    index     Or   df1 loc df1 index difference df1 merge df2  how  inner   on   col1    col2    index

User · Answer

As already hinted at  isin requires columns and indices to be the same for a match  If match should only be on row contents  one way to get the mask for filtering the rows present is to convert the rows to a  Multi Index   In  77   df1   pandas DataFrame data     col1     1  2  3  4  5  3    col2     10  11  12  13  14  10    In  78   df2   pandas DataFrame data     col1     1  3  4    col2     10  12  13    In  79   df1 loc  df1 set index list df1 columns   index isin df2 set index list df2 columns   index   Out 79      col1  col2 1     2    11 4     5    14 5     3    10   If index should be taken into account  set index has keyword argument append to append columns to existing index  If columns do not line up  list df columns  can be replaced with column specifications to align the data   pandas MultiIndex from tuples df lt N gt  to records index   False  tolist      could alternatively be used to create the indices  though I doubt this is more efficient

User · Answer

Suppose you have two dataframes  df 1 and df 2 having multiple fields column names  and you want to find the only those entries in df 1 that are not in df 2 on the basis of some fields e g  fields x  fields y   follow the following steps   Step1 Add a column key1 and key2 to df 1 and df 2 respectively   Step2 Merge the dataframes as shown below  field x and field y are our desired columns   Step3 Select only those rows from df 1 where key1 is not equal to key2   Step4 Drop key1 and key2   This method will solve your problem and works fast even with big data sets  I have tried it for dataframes with more than 1 000 000 rows   df 1  key1     1 df 2  key2     1 df 1   pd merge df 1  df 2  on   field x    field y    how    left   df 1   df 1   df 1 key2    df 1 key1   df 1   df 1 drop   key1   key2    axis 1

User · Answer

You can also concat df1  df2   x   pd concat  df1  df2     and then remove all duplicates   y   x drop duplicates keep False  inplace False

User · Answer

One method would be to store the result of an inner merge form both dfs  then we can simply select the rows when one column s values are not in this common   In  119    common   df1 merge df2 on   col1   col2    print common  df1   df1 col1 isin common col1   amp   df1 col2 isin common col2       col1  col2 0     1    10 1     2    11 2     3    12 Out 119      col1  col2 3     4    13 4     5    14   EDIT  Another method as you ve found is to use isin which will produce NaN rows which you can drop   In  138    df1  df1 isin df2   dropna   Out 138      col1  col2 3     4    13 4     5    14   However if df2 does not start rows in the same manner then this won t work   df2   pd DataFrame data     col1     2  3 4    col2     11  12 13      will produce the entire df   In  140    df1  df1 isin df2   dropna   Out 140      col1  col2 0     1    10 1     2    11 2     3    12 3     4    13 4     5    14

User · Answer

a bit late  but it might be worth checking the  indicator  parameter of pd merge   See this other question for an example  Compare PandaS DataFrames and return rows that are missing from the first one

User · Answer

you can do it using isin dict  method   In  74   df1  df1 isin df2 to dict  l    all 1   Out 74      col1  col2 3     4    13 4     5    14   Explanation   In  75   df2 to dict  l   Out 75     col1    1  2  3    col2    10  11  12    In  76   df1 isin df2 to dict  l    Out 76       col1   col2 0   True   True 1   True   True 2   True   True 3  False  False 4  False  False  In  77   df1 isin df2 to dict  l    all 1  Out 77   0     True 1     True 2     True 3    False 4    False dtype  bool

User · Answer

This is the best way to do it  df   df1 drop duplicates   merge df2 drop duplicates    on df2 columns to list                        how  left   indicator True  df loc df  merge   left only  df columns    merge    Note that drop duplicated is used to minimize the comparisons  It would work without them as well  The best way is to compare the row contents themselves and not the index or one two columns and same code can be used for other filters like  both  and  right only  as well to achieve similar results  For this syntax dataframes can have any number of columns and even different indices  Only the columns should occur in both the dataframes  Why this is the best way   index difference only works for unique index based comparisons pandas concat   coupled with drop duplicated   is not ideal because it will also get rid of the rows which may be only in the dataframe you want to keep and are duplicated for valid reasons

User · Answer

How about this   df1   pandas DataFrame data     col1     1  2  3  4  5                                    col2     10  11  12  13  14     df2   pandas DataFrame data     col1     1  2  3                                    col2     10  11  12    records df2   set  tuple row  for row in df2 values   in df2 mask   np array  tuple row  in records df2 for row in df1 values   result   df1  in df2 mask

[python] pandas get rows which are NOT in other dataframe

Examples related to python

Examples related to pandas

Examples related to dataframe