Drop all duplicate rows across multiple columns in Python Pandas

Question

The pandas drop duplicates function is great for  uniquifying  a dataframe  However  one of the keyword arguments to pass is take last True or take last False  while I would like to drop all rows which are duplicates across a subset of columns  Is this possible       A   B   C 0   foo 0   A 1   foo 1   A 2   foo 1   B 3   bar 1   A   As an example  I would like to drop rows which match on columns A and C so this should drop rows 0 and 1

User · Answer

If you want result to be stored in another dataset:

df.drop_duplicates(keep=False)

or

df.drop_duplicates(keep=False, inplace=False)

If same dataset needs to be updated:

df.drop_duplicates(keep=False, inplace=True)

Above examples will remove all duplicates and keep one, similar to DISTINCT * in SQL

User · Answer

This is much easier in pandas now with drop duplicates and the keep parameter   import pandas as pd df   pd DataFrame   A    foo    foo    foo    bar     B   0 1 1 1    C    A   A   B   A     df drop duplicates subset   A    C    keep False

User · Answer

Try these various things  df   pd DataFrame   A    foo    foo    foo    bar   foo     B   0 1 1 1 1    C    A   A   B   A   A       gt  gt  gt df drop duplicates   A    keep  first     or    gt  gt  gt df drop duplicates  keep  first     or   gt  gt  gt df drop duplicates  keep  last

User · Answer

use groupby and filter  import pandas as pd df   pd DataFrame   A    foo    foo    foo    bar     B   0 1 1 1    C    A   A   B   A     df groupby   A    C    filter lambda df df shape 0     1

User · Answer

Just want to add to Ben s answer on drop duplicates   keep       first        last     False   default    first      first   Drop duplicates except for the first occurrence  last   Drop duplicates except for the last occurrence  False   Drop all duplicates    So setting keep to False will give you desired answer      DataFrame drop duplicates  args    kwargs  Return DataFrame with   duplicate rows removed  optionally only considering certain columns      Parameters     subset   column label or sequence of labels  optional   Only consider certain columns for identifying duplicates  by default   use all of the columns keep       first        last     False   default      first    first   Drop duplicates except for the first occurrence  last     Drop duplicates except for the last occurrence  False   Drop all   duplicates  take last   deprecated inplace   boolean  default False   Whether to drop duplicates in place or to return a copy cols   kwargs   only argument of subset  deprecated  Returns   deduplicated     DataFrame

User · Answer

Actually  drop rows 0 and 1 only requires  any observations containing matched A and C is kept     In  335    df  AC   df A df C In  336    print df drop duplicates  C   take last True   this dataset is a special case  in general  one may need to first drop duplicates by  c  and then by  a        A  B  C    AC 2  foo  1  B  fooB 3  bar  1  A  barA   2 rows x 4 columns    But I suspect what you really want is this  one observation containing matched A and C is kept     In  337    print df drop duplicates  AC        A  B  C    AC 0  foo  0  A  fooA 2  foo  1  B  fooB 3  bar  1  A  barA   3 rows x 4 columns    Edit   Now it is much clearer  therefore   In  352   DG df groupby   A    C       print pd concat  DG get group item  for item  value in DG groups items   if len value   1        A  B  C 2  foo  1  B 3  bar  1  A   2 rows x 3 columns

[python] Drop all duplicate rows across multiple columns in Python Pandas

Examples related to python

Examples related to pandas

Examples related to duplicates

Examples related to drop-duplicates