Remove pandas rows with duplicate indices

Question

How to remove rows with duplicate index values  In the weather DataFrame below  sometimes a scientist goes back and corrects observations -- not by editing the erroneous rows  but by appending a duplicate row to the end of a file  I m reading some automated weather data from the web  observations occur every 5 minutes  and compiled into monthly files for each weather station   After parsing a file  the DataFrame looks like                        Sta  Precip1hr  Precip5min  Temp  DewPnt  WindSpd  WindDir  AtmPress Date                                                                                       2001-01-01 00 00 00  KPDX          0           0     4       3        0        0     30 31 2001-01-01 00 05 00  KPDX          0           0     4       3        0        0     30 30 2001-01-01 00 10 00  KPDX          0           0     4       3        4       80     30 30 2001-01-01 00 15 00  KPDX          0           0     3       2        5       90     30 30 2001-01-01 00 20 00  KPDX          0           0     3       2       10      110     30 28  Example of a duplicate case  import pandas  import datetime  startdate   datetime datetime 2001  1  1  0  0  enddate   datetime datetime 2001  1  1  5  0  index   pandas DatetimeIndex start startdate  end enddate  freq  H   data1     A    range 6    B    range 6   data2     A     20  -30  40    B     -50  60  -70   df1   pandas DataFrame data data1  index index  df2   pandas DataFrame data data2  index index  3   df3   df2 append df1   df3                        A   B 2001-01-01 00 00 00   20 -50 2001-01-01 01 00 00  -30  60 2001-01-01 02 00 00   40 -70 2001-01-01 03 00 00    3   3 2001-01-01 04 00 00    4   4 2001-01-01 05 00 00    5   5 2001-01-01 00 00 00    0   0 2001-01-01 01 00 00    1   1 2001-01-01 02 00 00    2   2  And so I need df3 to eventually become                         A   B 2001-01-01 00 00 00    0   0 2001-01-01 01 00 00    1   1 2001-01-01 02 00 00    2   2 2001-01-01 03 00 00    3   3 2001-01-01 04 00 00    4   4 2001-01-01 05 00 00    5   5  I thought that adding a column of row numbers  df3  rownum     range df3 shape 0    would help me select the bottom-most row for any value of the DatetimeIndex  but I am stuck on figuring out the group by or pivot  or      statements to make that work

User · Answer

You can drop index duplicates with  drop duplicates   df loc df index drop duplicates keep  first

User · Answer

I would suggest using the duplicated method on the Pandas Index itself  df3   df3  df3 index duplicated keep  first     While all the other methods work  the currently accepted answer is by far the least performant for the provided example  Furthermore  while the groupby method is only slightly less performant  I find the duplicated method to be more readable  Using the sample data provided   gt  gt  gt   timeit df3 reset index   drop duplicates subset  index   keep  first   set index  index   1000 loops  best of 3  1 54 ms per loop   gt  gt  gt   timeit df3 groupby df3 index  first   1000 loops  best of 3  580   s per loop   gt  gt  gt   timeit df3  df3 index duplicated keep  first    1000 loops  best of 3  307   s per loop  Note that you can keep the last element by changing the keep argument to  last   It should also be noted that this method works with MultiIndex as well  using df1 as specified in Paul s example    gt  gt  gt   timeit df1 groupby level df1 index names  last   1000 loops  best of 3  771   s per loop   gt  gt  gt   timeit df1  df1 index duplicated keep  last    1000 loops  best of 3  365   s per loop

User · Answer

Unfortunately  I don t think Pandas allows one to drop dups off the indices  I would suggest the following   df3   df3 reset index     makes date column part of your data df3 columns     timestamp   A   B   rownum     set names df3   df3 drop duplicates  timestamp  take last True  set index  timestamp    done

User · Answer

If anyone like me likes chainable data manipulation using the pandas dot notation  like piping   then the following may be useful    df3   df3 query   index duplicated       This enables chaining statements like this   df3 assign C 2  query   index duplicated     mean

User · Answer

Remove duplicates  Keeping First   idx   np unique  df index values  return index   True   1  df   df iloc idx    Remove duplicates  Keeping Last   df   df   -1  df   df iloc  np unique  df index values  return index   True   1      Tests  10k loops using OP s data  numpy method - 3 03 seconds df loc  df index duplicated keep  first    - 4 43 seconds df groupby df index  first   - 21 seconds reset index   method - 29 seconds

User · Answer

This adds the index as a dataframe column  drops duplicates on that  then removes the new column  df   df reset index   drop duplicates subset  index   keep  last   set index  index   sort index    Note that the use of  sort index   above at the end is as needed and is optional

User · Answer

Oh my  This is actually so simple   grouped   df3 groupby level 0  df4   grouped last   df4                       A   B  rownum  2001-01-01 00 00 00   0   0       6 2001-01-01 01 00 00   1   1       7 2001-01-01 02 00 00   2   2       8 2001-01-01 03 00 00   3   3       3 2001-01-01 04 00 00   4   4       4 2001-01-01 05 00 00   5   5       5   Follow up edit 2013-10-29 In the case where I have a fairly complex MultiIndex  I think I prefer the groupby approach  Here s simple example for posterity   import numpy as np import pandas    fake index idx   pandas MultiIndex from tuples    a   letter  for letter in list  abcde        random data   naming the index levels df1   pandas DataFrame np random normal size  5 2    index idx  columns   colA    colB    df1 index names     iA    iB      artificially append some duplicate data df1   df1 append df1 select lambda idx  idx 1  in   c    e     df1             colA      colB  iA iB                      a  a  -1 297535  0 691787     b  -1 688411  0 404430     c   0 275806 -0 078871     d  -0 509815 -0 220326     e  -0 066680  0 607233     c   0 275806 -0 078871     lt --- dup 1     e  -0 066680  0 607233     lt --- dup 2   and here s the important part    group the data  using df1 index names tells pandas to look at the entire index groups   df1 groupby level df1 index names    groups last     or  first               colA      colB  iA iB                      a  a  -1 297535  0 691787     b  -1 688411  0 404430     c   0 275806 -0 078871     d  -0 509815 -0 220326     e  -0 066680  0 607233

[python] Remove pandas rows with duplicate indices

Examples related to python

Examples related to pandas

Examples related to dataframe

Examples related to duplicates