Finding common rows intersection in two Pandas dataframes

Question

Assume I have two dataframes of this format  call them df1 and df2     ------------------------ ------------------------ --------           user id                business id         rating    ------------------------ ------------------------ --------    rLtl8ZkDX5vH5nAx9C3q5Q   eIxSLxzIlfExI6vgAbn2JA        4     C6IOtaaYdLIT5fWd7ZYIuA   eIxSLxzIlfExI6vgAbn2JA        5     mlBC3pN9GXlUUfQi1qBBZA   KoIRdcIfh3XWxiCeV1BDmA        3    ------------------------ ------------------------ --------    I m looking to get a dataframe of all the rows that have a common user id in df1 and df2   ie  if a user id is in both df1 and df2  include the two rows in the output dataframe   I can think of many ways to approach this  but they all strike me as clunky  For example  we could find all the unique user ids in each dataframe  create a set of each  find their intersection  filter the two dataframes with the resulting set and concatenate the two filtered dataframes   Maybe that s the best approach  but I know Pandas is clever  Is there a simpler way to do this  I ve looked at merge but I don t think that s what I need

User · Accepted Answer

My understanding is that this question is better answered over in this post.

But briefly, the answer to the OP with this method is simply:

s1 = pd.merge(df1, df2, how='inner', on=['user_id'])

Which gives s1 with 5 columns: user_id and the other two columns from each of df1 and df2.

User · Answer

If I understand you correctly  you can use a combination of Series isin   and DataFrame append     In  80   df1 Out 80      rating  user id 0       2  0x21abL 1       1  0x21abL 2       1   0xdafL 3       0  0x21abL 4       4  0x1d14L 5       2  0x21abL 6       1  0x21abL 7       0   0xdafL 8       4  0x1d14L 9       1  0x21abL  In  81   df2 Out 81      rating      user id 0       2      0x1d14L 1       1    0xdbdcad7 2       1      0x21abL 3       3      0x21abL 4       3      0x21abL 5       1  0x5734a81e2 6       2      0x1d14L 7       0       0xdafL 8       0      0x1d14L 9       4  0x5734a81e2  In  82   ind   df2 user id isin df1 user id   amp  df1 user id isin df2 user id   In  83   ind Out 83   0     True 1    False 2     True 3     True 4     True 5    False 6     True 7     True 8     True 9    False Name  user id  dtype  bool  In  84   df1 ind  append df2 ind   Out 84      rating  user id 0       2  0x21abL 2       1   0xdafL 3       0  0x21abL 4       4  0x1d14L 6       1  0x21abL 7       0   0xdafL 8       4  0x1d14L 0       2  0x1d14L 2       1  0x21abL 3       3  0x21abL 4       3  0x21abL 6       2  0x1d14L 7       0   0xdafL 8       0  0x1d14L   This is essentially the algorithm you described as  clunky   using idiomatic pandas methods  Note the duplicate row indices  Also  note that this won t give you the expected output if df1 and df2 have no overlapping row indices  i e   if  In  93   df1 index  amp  df2 index Out 93   Int64Index     dtype  int64     In fact  it won t give the expected output if their row indices are not equal

User · Answer

In SQL  this problem could be solved by several methods   select   from df1 where exists  select   from df2 where df2 user id   df1 user id  union all select   from df2 where exists  select   from df1 where df1 user id   df2 user id    or join and then unpivot  possible in SQL server   select     df1 user id      c rating from df1     inner join df2 on df2 user i   df1 user id     outer apply           select df1 rating union all         select df2 rating       as c   Second one could be written in pandas with something like    gt  gt  gt  df1   pd DataFrame   user id   1 2 3    rating   10  15  20     gt  gt  gt  df2   pd DataFrame   user id   3 4 5    rating   30  35  40     gt  gt  gt   gt  gt  gt  df4   df   user id    rating 1    rename columns   rating 1   rating     gt  gt  gt  df   pd merge df1  df2  on  user id   suffixes    1     2     gt  gt  gt  df3   df   user id    rating 1    rename columns   rating 1   rating     gt  gt  gt  df4   df   user id    rating 2    rename columns   rating 2   rating     gt  gt  gt  pd concat  df3  df4   axis 0     user id  rating 0        3      20 0        3      30

[python] Finding common rows (intersection) in two Pandas dataframes

Examples related to python

Examples related to pandas

Examples related to dataframe

Examples related to intersect