How do I get a list of all the duplicate items using pandas in python

Question

I have a list of items that likely has some export issues   I would like to get a list of the duplicate items so I can manually compare them   When I try to use pandas duplicated method  it only returns the first duplicate   Is there a a way to get all of the duplicates and not just the first one   A small subsection of my dataset looks like this   ID ENROLLMENT DATE TRAINER MANAGING TRAINER OPERATOR FIRST VISIT DATE 1536D 12-Feb-12  06DA1B3-Lebanon NH   15-Feb-12 F15D 18-May-12  06405B2-Lebanon NH   25-Jul-12 8096 8-Aug-12  0643D38-Hanover NH   0643D38-Hanover NH  25-Jun-12 A036 1-Apr-12  06CB8CF-Hanover NH   06CB8CF-Hanover NH  9-Aug-12 8944 19-Feb-12  06D26AD-Hanover NH   4-Feb-12 1004E 8-Jun-12  06388B2-Lebanon NH   24-Dec-11 11795 3-Jul-12  0649597-White River VT   0649597-White River VT  30-Mar-12 30D7 11-Nov-12  06D95A3-Hanover NH   06D95A3-Hanover NH  30-Nov-11 3AE2 21-Feb-12  06405B2-Lebanon NH   26-Oct-12 B0FE 17-Feb-12  06D1B9D-Hartland VT   16-Feb-12 127A1 11-Dec-11  064456E-Hanover NH   064456E-Hanover NH  11-Nov-12 161FF 20-Feb-12  0643D38-Hanover NH   0643D38-Hanover NH  3-Jul-12 A036 30-Nov-11  063B208-Randolph VT   063B208-Randolph VT   475B 25-Sep-12  06D26AD-Hanover NH   5-Nov-12 151A3 7-Mar-12  06388B2-Lebanon NH   16-Nov-12 CA62 3-Jan-12    D31B 18-Dec-11  06405B2-Lebanon NH   9-Jan-12 20F5 8-Jul-12  0669C50-Randolph VT   3-Feb-12 8096 19-Dec-11  0649597-White River VT   0649597-White River VT  9-Apr-12 14E48 1-Aug-12  06D3206-Hanover NH    177F8 20-Aug-12  063B208-Randolph VT   063B208-Randolph VT  5-May-12 553E 11-Oct-12  06D95A3-Hanover NH   06D95A3-Hanover NH  8-Mar-12 12D5F 18-Jul-12  0649597-White River VT   0649597-White River VT  2-Nov-12 C6DC 13-Apr-12  06388B2-Lebanon NH    11795 27-Feb-12  0643D38-Hanover NH   0643D38-Hanover NH  19-Jun-12 17B43 11-Aug-12   22-Oct-12 A036 11-Aug-12  06D3206-Hanover NH   19-Jun-12   My code looks like this currently   df bigdata duplicates   df bigdata df bigdata duplicated cols  ID      There area a couple duplicate items  But  when I use the above code  I only get the first item   In the API reference  I see how I can get the last item  but I would like to have all of them so I can visually inspect them to see why I am getting the discrepancy   So  in this example I would like to get all three A036 entries and both 11795 entries and any other duplicated entries  instead of the just first one   Any help is most appreciated

User · Answer

df df duplicated   ID     True  sort values  ID

User · Answer

Using an element-wise logical or and setting the take last argument of the pandas duplicated method to both True and False you can obtain a set from your dataframe that includes all of the duplicates   df bigdata duplicates        df bigdata df bigdata duplicated cols  ID   take last False                   df bigdata duplicated cols  ID   take last True

User · Answer

df df  ID   duplicated      True    This worked for me

User · Answer

For my database duplicated keep False  did not work until the column was sorted   data sort values by   Order ID    inplace True  df   data data  Order ID   duplicated keep False

User · Answer

Method  1  print all rows where the ID is one of the IDs in duplicated    gt  gt  gt  import pandas as pd  gt  gt  gt  df   pd read csv  dup csv    gt  gt  gt  ids   df  ID    gt  gt  gt  df ids isin ids ids duplicated      sort  ID          ID ENROLLMENT DATE        TRAINER MANAGING        TRAINER OPERATOR FIRST VISIT DATE 24  11795       27-Feb-12      0643D38-Hanover NH      0643D38-Hanover NH        19-Jun-12 6   11795        3-Jul-12  0649597-White River VT  0649597-White River VT        30-Mar-12 18   8096       19-Dec-11  0649597-White River VT  0649597-White River VT         9-Apr-12 2    8096        8-Aug-12      0643D38-Hanover NH      0643D38-Hanover NH        25-Jun-12 12   A036       30-Nov-11     063B208-Randolph VT     063B208-Randolph VT              NaN 3    A036        1-Apr-12      06CB8CF-Hanover NH      06CB8CF-Hanover NH         9-Aug-12 26   A036       11-Aug-12      06D3206-Hanover NH                     NaN        19-Jun-12   but I couldn t think of a nice way to prevent repeating ids so many times   I prefer method  2  groupby on the ID    gt  gt  gt  pd concat g for    g in df groupby  ID   if len g   gt  1         ID ENROLLMENT DATE        TRAINER MANAGING        TRAINER OPERATOR FIRST VISIT DATE 6   11795        3-Jul-12  0649597-White River VT  0649597-White River VT        30-Mar-12 24  11795       27-Feb-12      0643D38-Hanover NH      0643D38-Hanover NH        19-Jun-12 2    8096        8-Aug-12      0643D38-Hanover NH      0643D38-Hanover NH        25-Jun-12 18   8096       19-Dec-11  0649597-White River VT  0649597-White River VT         9-Apr-12 3    A036        1-Apr-12      06CB8CF-Hanover NH      06CB8CF-Hanover NH         9-Aug-12 12   A036       30-Nov-11     063B208-Randolph VT     063B208-Randolph VT              NaN 26   A036       11-Aug-12      06D3206-Hanover NH                     NaN        19-Jun-12

User · Answer

As I am unable to comment  hence posting as a separate answer  To find duplicates on the basis of more than one column  mention every column name  as below  and it will return you all the duplicated rows set   df df   product uid    product title    user    duplicated      True

User · Answer

This may not be a solution to the question  but to illustrate examples   import pandas as pd  df   pd DataFrame        A    1 1 3 4        B    2 2 5 6        C    3 4 7 6       print df  df duplicated keep False  df duplicated   A   B    keep False    The outputs      A  B  C 0  1  2  3 1  1  2  4 2  3  5  7 3  4  6  6  0    False 1    False 2    False 3    False dtype  bool  0     True 1     True 2    False 3    False dtype  bool

User · Answer

df df duplicated   ID    keep False     it ll return all duplicated rows back to you   According to documentation      keep       first        last     False   default    first            first   Mark duplicates as True except for the first occurrence    last   Mark duplicates as True except for the last occurrence    False   Mark all duplicates as True

User · Answer

sort  ID   does not seem to be working now  seems deprecated as per sort doc  so use sort values  ID   instead to sort after duplicate filter  as following   df df ID duplicated keep False   sort values  ID

User · Answer

With Pandas version 0 17  you can set  keep   False  in the duplicated function to get all the duplicate items   In  1   import pandas as pd  In  2   df   pd DataFrame   a   b   c   d   a   b     In  3   df Out 3           0     0  a     1  b     2  c     3  d     4  a     5  b  In  4   df df duplicated keep False   Out 4           0     0  a     1  b     4  a     5  b

[python] How do I get a list of all the duplicate items using pandas in python?

Examples related to python

Examples related to pandas

Examples related to duplicates