Accepted answer Method 1 will not work for data frames with NaNs inside, as pd.np.nan != pd.np.nan
. I am not sure if this is the best way, but it can be avoided by
df1[~df1.astype(str).apply(tuple, 1).isin(df2.astype(str).apply(tuple, 1))]
It's slower, because it needs to cast data to string, but thanks to this casting pd.np.nan == pd.np.nan
.
Let's go trough the code. First we cast values to string, and apply tuple
function to each row.
df1.astype(str).apply(tuple, 1)
df2.astype(str).apply(tuple, 1)
Thanks to that, we get pd.Series
object with list of tuples. Each tuple contains whole row from df1
/df2
.
Then we apply isin
method on df1
to check if each tuple "is in" df2
.
The result is pd.Series
with bool values. True if tuple from df1
is in df2
. In the end, we negate results with ~
sign, and applying filter on df1
. Long story short, we get only those rows from df1
that are not in df2
.
To make it more readable, we may write it as:
df1_str_tuples = df1.astype(str).apply(tuple, 1)
df2_str_tuples = df2.astype(str).apply(tuple, 1)
df1_values_in_df2_filter = df1_str_tuples.isin(df2_str_tuples)
df1_values_not_in_df2 = df1[~df1_values_in_df2_filter]