[mysql] How to delete duplicates on a MySQL table?

Deleting duplicate rows in MySQL in-place, (Assuming you have a timestamp col to sort by) walkthrough:

Create the table and insert some rows:

create table penguins(foo int, bar varchar(15), baz datetime);
insert into penguins values(1, 'skipper', now());
insert into penguins values(1, 'skipper', now());
insert into penguins values(3, 'kowalski', now());
insert into penguins values(3, 'kowalski', now());
insert into penguins values(3, 'kowalski', now());
insert into penguins values(4, 'rico', now());
select * from penguins;
    +------+----------+---------------------+
    | foo  | bar      | baz                 |
    +------+----------+---------------------+
    |    1 | skipper  | 2014-08-25 14:21:54 |
    |    1 | skipper  | 2014-08-25 14:21:59 |
    |    3 | kowalski | 2014-08-25 14:22:09 |
    |    3 | kowalski | 2014-08-25 14:22:13 |
    |    3 | kowalski | 2014-08-25 14:22:15 |
    |    4 | rico     | 2014-08-25 14:22:22 |
    +------+----------+---------------------+
6 rows in set (0.00 sec)

Remove the duplicates in place:

delete a
    from penguins a
    left join(
    select max(baz) maxtimestamp, foo, bar
    from penguins
    group by foo, bar) b
    on a.baz = maxtimestamp and
    a.foo = b.foo and
    a.bar = b.bar
    where b.maxtimestamp IS NULL;
Query OK, 3 rows affected (0.01 sec)
select * from penguins;
+------+----------+---------------------+
| foo  | bar      | baz                 |
+------+----------+---------------------+
|    1 | skipper  | 2014-08-25 14:21:59 |
|    3 | kowalski | 2014-08-25 14:22:15 |
|    4 | rico     | 2014-08-25 14:22:22 |
+------+----------+---------------------+
3 rows in set (0.00 sec)

You're done, duplicate rows are removed, last one by timestamp is kept.

For those of you without a timestamp or unique column.

You don't have a timestamp or a unique index column to sort by? You're living in a state of degeneracy. You'll have to do additional steps to delete duplicate rows.

create the penguins table and add some rows

create table penguins(foo int, bar varchar(15)); 
insert into penguins values(1, 'skipper'); 
insert into penguins values(1, 'skipper'); 
insert into penguins values(3, 'kowalski'); 
insert into penguins values(3, 'kowalski'); 
insert into penguins values(3, 'kowalski'); 
insert into penguins values(4, 'rico'); 
select * from penguins; 
    # +------+----------+ 
    # | foo  | bar      | 
    # +------+----------+ 
    # |    1 | skipper  | 
    # |    1 | skipper  | 
    # |    3 | kowalski | 
    # |    3 | kowalski | 
    # |    3 | kowalski | 
    # |    4 | rico     | 
    # +------+----------+ 

make a clone of the first table and copy into it.

drop table if exists penguins_copy; 
create table penguins_copy as ( SELECT foo, bar FROM penguins );  

#add an autoincrementing primary key: 
ALTER TABLE penguins_copy ADD moo int AUTO_INCREMENT PRIMARY KEY first; 

select * from penguins_copy; 
    # +-----+------+----------+ 
    # | moo | foo  | bar      | 
    # +-----+------+----------+ 
    # |   1 |    1 | skipper  | 
    # |   2 |    1 | skipper  | 
    # |   3 |    3 | kowalski | 
    # |   4 |    3 | kowalski | 
    # |   5 |    3 | kowalski | 
    # |   6 |    4 | rico     | 
    # +-----+------+----------+ 

The max aggregate operates upon the new moo index:

delete a from penguins_copy a left join( 
    select max(moo) myindex, foo, bar 
    from penguins_copy 
    group by foo, bar) b 
    on a.moo = b.myindex and 
    a.foo = b.foo and 
    a.bar = b.bar 
    where b.myindex IS NULL; 

#drop the extra column on the copied table 
alter table penguins_copy drop moo; 
select * from penguins_copy; 

#drop the first table and put the copy table back: 
drop table penguins; 
create table penguins select * from penguins_copy; 

observe and cleanup

drop table penguins_copy; 
select * from penguins;
+------+----------+ 
| foo  | bar      | 
+------+----------+ 
|    1 | skipper  | 
|    3 | kowalski | 
|    4 | rico     | 
+------+----------+ 
    Elapsed: 1458.359 milliseconds 

What's that big SQL delete statement doing?

Table penguins with alias 'a' is left joined on a subset of table penguins called alias 'b'. The right hand table 'b' which is a subset finds the max timestamp [ or max moo ] grouped by columns foo and bar. This is matched to left hand table 'a'. (foo,bar,baz) on left has every row in the table. The right hand subset 'b' has a (maxtimestamp,foo,bar) which is matched to left only on the one that IS the max.

Every row that is not that max has value maxtimestamp of NULL. Filter down on those NULL rows and you have a set of all rows grouped by foo and bar that isn't the latest timestamp baz. Delete those ones.

Make a backup of the table before you run this.

Prevent this problem from ever happening again on this table:

If you got this to work, and it put out your "duplicate row" fire. Great. Now define a new composite unique key on your table (on those two columns) to prevent more duplicates from being added in the first place.

Like a good immune system, the bad rows shouldn't even be allowed in to the table at the time of insert. Later on all those programs adding duplicates will broadcast their protest, and when you fix them, this issue never comes up again.