How to count duplicate rows in pandas dataframe

Question

I am trying to count the duplicates of each type of row in my dataframe  For example  say that I have a dataframe in pandas as follows   df   pd DataFrame   one   pd Series  1   1  1                        two   pd Series  1   2   1       I get a df that looks like this       one two 0   1   1 1   1   2 2   1   1   I imagine the first step is to find all the different unique rows  which I do by   df drop duplicates     This gives me the following df       one two 0   1   1 1   1   2   Now I want to take each row from the above df   1 1  and  1 2   and get a count of how many times each is in the initial df  My result would look something like this   Row     Count  1 1      2  1 2      1   How should I go about doing this last step   Edit   Here s a larger example to make it more clear   df   pd DataFrame   one   pd Series  True  True  True  False                        two   pd Series  True  False  False  True                        three   pd Series  True  False  False  False       gives me       one three   two 0   True    True    True 1   True    False   False 2   True    False   False 3   False   False   True   I want a result that tells me          Row           Count  True True True        1  True False False      2  False False True      1

User · Answer

If you like to count duplicates on particular column(s):

len(df['one'])-len(df['one'].drop_duplicates())

If you want to count duplicates on entire dataframe:

len(df)-len(df.drop_duplicates())

Or simply you can use DataFrame.duplicated(subset=None, keep='first'):

df.duplicated(subset='one', keep='first').sum()

where

subset : column label or sequence of labels(by default use all of the columns)

keep : {‘first’, ‘last’, False}, default ‘first’

first : Mark duplicates as True except for the first occurrence.
last : Mark duplicates as True except for the last occurrence.
False : Mark all duplicates as True.

User · Answer

df   pd DataFrame   one    pd Series  1   1  1  3     two    pd Series  1   2   1  3      three    pd Series  1   2   1  2      df  str list     df apply lambda row      join  str int val   for val in row    axis 1  df1   pd DataFrame df  str list   value counts   values  index df  str list   value counts   index  columns   Count      Produces    gt  gt  gt  df1        Count 1 1 1      2 3 2 3      1 1 2 2      1   If the index values must be a list  you could take the above code a step further with   df1 index   df1 index str split    Produces              Count  1  1  1       2  3  2  3       1  1  2  2       1

User · Answer

I use   used features         one        two        three     df  is duplicated     df duplicated used features  df  is duplicated   sum     which gives count of duplicated rows  and then you can analyse them by a new column  I didn t see such solution here

User · Answer

You can groupby on all the columns and call size the index indicates the duplicate values   In  28   df groupby df columns tolist   as index False  size    Out 28   one    three  two   False  False  True     1 True   False  False    2        True   True     1 dtype  int64

User · Answer

None of the existing answers quite offers a simple solution that returns  the number of rows that are just duplicates and should be cut out   This is a one-size-fits-all solution that does     generate a table of those culprit rows which are duplicated  dups   df groupby df columns tolist    size   reset index   rename columns  0  count       sum the final col of that table  and subtract the number of culprits  dups  count   sum   - dups shape 0

User · Answer

df groupby df columns tolist    size   reset index         rename columns  0  records        one  two  records 0    1    1        2 1    1    2        1

User · Answer

ran into this problem today and wanted to include NaNs so I replace them temporarily with     empty string   Please comment if you do not understand something     This solution assumes that    is not a relevant value for you  It should also work with numerical data  I have tested it sucessfully but not extensively  since pandas will infer the data type again after replacing    with np nan   import pandas as pd    create test data df   pd DataFrame   test    foo   bar  None None  foo                       test2    bar  None None None  bar                       test3   None   foo   bar  None None       fill null values with    to not lose them during groupby   groupby all columns and calculate the length of the resulting groups   rename the series obtained with groupby to  group count    reset the index to get a DataFrame   replace    with np nan  this reverts our first operation    sort DataFrame by  group count  descending df    df fillna             groupby df columns tolist    apply len          rename  group count           reset index           replace    np nan          sort values by     group count    ascending   False   df     test test2 test3  group count 3  foo   bar   NaN            2 0  NaN   NaN   NaN            1 1  NaN   NaN   bar            1 2  bar   NaN   foo            1

[python] How to count duplicate rows in pandas dataframe?

Examples related to python

Examples related to pandas