[sql] What's the best way to dedupe a table?

I've seen a couple of solutions for this, but I'm wondering what the best and most efficient way is to de-dupe a table. You can use code (SQL, etc.) to illustrate your point, but I'm just looking for basic algorithms. I assumed there would already be a question about this on SO, but I wasn't able to find one, so if it already exists just give me a heads up.

(Just to clarify - I'm referring to getting rid of duplicates in a table that has an incremental automatic PK and has some rows that are duplicates in everything but the PK field.)

This question is related to sql algorithm performance duplicates

The answer is


I am taking the one from DShook and providing a dedupe example where you would keep only the record with the highest date.

In this example say I have 3 records all with the same app_id, and I only want to keep the one with the highest date:

DELETE t
FROM @USER_OUTBOX_APPS t
INNER JOIN  
(
    SELECT 
         app_id
        ,max(processed_date) as max_processed_date
    FROM @USER_OUTBOX_APPS
    GROUP BY app_id
    HAVING count(*) > 1
) t2 on 
    t.app_id = t2.app_id
WHERE 
    t.processed_date < t2.max_processed_date

These methods will work, but without an explicit id as a PK then determining which rows to delete could be a problem. The bounce out into a temp table delete from original and re-insert without the dupes seems to be the simplest.


You could generate a hash for each row (excluding the PK), store it in a new column (or if you can't add new columns, can you move the table to a temp staging area?), and then look for all other rows with the same hash. Of course, you would have to be able to ensure that your hash function doesn't produce the same code for different rows.

If two rows are duplicate, does it matter which you get rid of? Is it possible that other data are dependent on both of the duplicates? If so, you will have to go through a few steps:

  • Find the dupes
  • Choose one of them as dupeA to eliminate
  • Find all data dependent on dupeA
  • Alter that data to refer to dupeB
  • delete dupeA.

This could be easy or complicated, depending on your existing data model.

This whole scenario sounds like a maintenance and redesign project. If so, best of luck!!


Deduping is rarely simple. That's because the records to be dedupped often have slightly different values is some of the fields. Therefore choose which record to keep can be problematic. Further, dups are often people records and it is hard to identify if the two John Smith's are two people or one person who is duplicated. So spend a lot (50% or more of the whole project) of your time defining what constitutes a dup and how to handle the differences and child records.

How do you know which is the correct value? Further dedupping requires that you handle all child records not orphaning any. What happens when you find that by changing the id on the child record you are suddenly violating one of the unique indexes or constraints - this will happen eventually and your process needs to handle it. If you have chosen foolishly to apply all your constraints only thorough the application, you may not even know the constraints are violated. When you have 10,000 records to dedup, you aren't going to go through the application to dedup one at a time. If the constraint isn't in the database, lots of luck in maintaining data integrity when you dedup.

A further complication is that dups don't always match exactly on the name or address. For instance a salesrep named Joan Martin may be a dup of a sales rep names Joan Martin-Jones especially if they have the same address and email. OR you could have John or Johnny in the name. Or the same street address except one record abbreveiated ST. and one spelled out Street. In SQL server you can use SSIS and fuzzy grouping to also identify near matches. These are often the most common dups as the fact that weren't exact matches is why they got put in as dups in the first place.

For some types of dedupping, you may need a user interface, so that the person doing the dedupping can choose which of two values to use for a particular field. This is especially true if the person who is being dedupped is in two or more roles. It could be that the data for a particular role is usually better than the data for another role. Or it could be that only the users will know for sure which is the correct value or they may need to contact people to find out if they are genuinely dups or simply two people with the same name.


Ran into the problem today, none of the existing answers helped me. Assume you want to deduplicate your table named your_table.

Step 1: Create a new table with deduped values

If borrowed this code from somewhere else on StackOverflow but can't seem to find it again. It works fine against PostgreSQL. It creates a table your_table_deduped where (col1, col2) are unique.

CREATE TABLE your_table_deduped AS
SELECT * FROM your_table WHERE ctid NOT IN
(SELECT ctid FROM
  (SELECT ctid, ROW_NUMBER() OVER
    (PARTITION BY col1, col2 ORDER BY ctid) AS rnum
  FROM your_table) t
WHERE t.rnum > 1);

Step 2: Replace your first table with the deduped copy

We only delete the values in this step, because it allows us to keep the indexes, constraints, etc. in your table.

DELETE FROM your_table;
INSERT INTO your_table
SELECT * FROM your_table_deduped;

Step 3: Delete the deduped copy

DROP TABLE site_daily_kpis_dedup;

And voila, you have deduplicated your table!


Here's the method I use if you can get your dupe criteria into a group by statement and your table has an id identity column for uniqueness:

delete t
from tablename t
inner join  
(
    select date_time, min(id) as min_id
    from tablename
    group by date_time
    having count(*) > 1
) t2 on t.date_time = t2.date_time
where t.id > t2.min_id

In this example the date_time is the grouping criteria, if you have more than one column make sure to join on all of them.


For deduplicate / dedupe / remove duplication / remove repeated rows / ??? ?? / ??? ?? ????, there are multiple ways.

  1. If duplicated rows are exact the same, use group by

    create table TABLE_NAME_DEDUP
    as select column1, column2, ... (all column names) from TABLE_NAME group by column1, column2, -- all column names

Then TABLE_NAME_DEDUP is the deduplicated table.

For example,

create table test (t1 varchar(5), t2 varchar(5));
insert into test  values ('12345', 'ssdlh');
insert into test  values ('12345', 'ssdlh');
create table test_dedup as
select * from test 
group by t1, t2;
-----optional
--remove original table and rename dedup table to previous table
--this is not recommend in dev or qa. DROP table test; Alter table test_dedup rename to test;
  1. You have a rowid, the rowid has duplication but other columns are different Records partial same, this may happened in a transactional system while update a row, and the rows failed to update will have nulls. You want to remove the duplication

    create table test_dedup as select column1, column2, ... (all column names) from ( select * , row_number() over (partition by rowid order by column1, column2, ... (all column names except rowid) ) as cn from test ) where cn =1

This is using the feature that when you use order by, the null value will be ordered behind the non-null value.

create table test (rowid_ varchar(5), t1 varchar(5), t2 varchar(5));
insert into test  values ('12345', 'ssdlh', null);
insert into test  values ('12345', 'ssdlh', 'lhbzj');
create table test_dedup as
select rowid_, t1, t2 from
(select *
  , row_number() over (partition by rowid_ order by t1, t2) as cn
  from  test)
 where cn =1
 ;

-----optional
--remove original table and rename dedup table to previous table
--this is not recommend in dev or qa. DROP table test; Alter table test_dedup rename to test;

delete from yourTable 
where Id not in (
    select min(id) 
    from yourTable
    group by <Unique Columns>
)

where id is whatever is your unique id in the table. (Could be customerNumber or whatever)

If you don't have a Unique Id, you can add one (every SQL table should already have Id as first column, but

ALTER TABLE yourTable
ADD Id int identity(1,1)

Do your delete (above) and then drop the column.

Better than creating a whole new table, or any of the other cryptic stuff I've seen. Note, pretty much the same as in a comment here, but this is what I've done for years.


For those of you who prefer a quick and dirty approach, just list all the columns that together define a unique record and create a unique index with those columns, like so:

ALTER IGNORE TABLE TABLE_NAME ADD UNIQUE (column1,column2,column3)

You can drop the unique index afterwords.


This can dedupe the duplicated values in c1:

select * from foo
minus
select f1.* from foo f1, foo f2
where f1.c1 = f2.c1 and f1.c2 > f2.c2

Here's one I've run into, in real life.

Assume you have a table of external/3rd party logins for users, and you're going to merge two users and want to dedupe on the provider/provider key values.

    ;WITH Logins AS
    (
        SELECT [LoginId],[UserId],[Provider],[ProviderKey]
        FROM [dbo].[UserLogin] 
        WHERE [UserId]=@FromUserID -- is the user we're deleting
              OR [UserId]=@ToUserID -- is the user we're moving data to
    ), Ranked AS 
    (
        SELECT Logins.*
            , [Picker]=ROW_NUMBER() OVER (
                       PARTITION BY [Provider],[ProviderKey]
                       ORDER BY CASE WHEN [UserId]=@FromUserID THEN 1 ELSE 0 END)
        FROM Logins
    )
    MERGE Logins AS T
    USING Ranked AS S
    ON S.[LoginId]=T.[LoginID]
    WHEN MATCHED AND S.[Picker]>1 -- duplicate Provider/ProviderKey
                 AND T.[UserID]=@FromUserID -- safety check 
    THEN DELETE
    WHEN MATCHED AND S.[Picker]=1 -- the only or best one
                 AND T.[UserID]=@FromUserID
    THEN UPDATE SET T.[UserID]=@ToUserID
    OUTPUT $action, DELETED.*, INSERTED.*;

For SQL, you may use the INSERT IGNORE INTO table SELECT xy FROM unkeyed_table;

For an algorithm, if you can assume that to-be-primary keys may be repeated, but a to-be-primary-key uniquely identifies the content of the row, than hash only the to-be-primary key and check for repetition.


SELECT DISTINCT <insert all columns but the PK here> FROM foo. Create a temp table using that query (the syntax varies by RDBMS but there's typically a SELECT … INTO or CREATE TABLE AS pattern available), then blow away the old table and pump the data from the temp table back into it.


I think this should require nothing more then just grouping by all columns except the id and choosing one row from every group - for simplicity just the first row, but this does not actually matter besides you have additional constraints on the id.

Or the other way around to get rid of the rows ... just delete all rows accept a single one from all groups.


Using analytic function row_number:

WITH CTE (col1, col2, dupcnt)
AS
(
SELECT col1, col2,
ROW_NUMBER() OVER (PARTITION BY col1, col2 ORDER BY col1) AS dupcnt
FROM Youtable
)
DELETE
FROM CTE
WHERE dupcnt > 1
GO                                                                 

Adding the actual code here for future reference

So, there are 3 steps, and therefore 3 SQL statements:

Step 1: Move the non duplicates (unique tuples) into a temporary table

CREATE TABLE new_table as
SELECT * FROM old_table WHERE 1 GROUP BY [column to remove duplicates by];

Step 2: delete the old table (or rename it) We no longer need the table with all the duplicate entries, so drop it!

DROP TABLE old_table;

Step 3: rename the new_table to the name of the old_table

RENAME TABLE new_table TO old_table;

And of course, don't forget to fix your buggy code to stop inserting duplicates!


Examples related to sql

Passing multiple values for same variable in stored procedure SQL permissions for roles Generic XSLT Search and Replace template Access And/Or exclusions Pyspark: Filter dataframe based on multiple conditions Subtracting 1 day from a timestamp date PYODBC--Data source name not found and no default driver specified select rows in sql with latest date for each ID repeated multiple times ALTER TABLE DROP COLUMN failed because one or more objects access this column Create Local SQL Server database

Examples related to algorithm

How can I tell if an algorithm is efficient? Find the smallest positive integer that does not occur in a given sequence Efficiently getting all divisors of a given number Peak signal detection in realtime timeseries data What is the optimal algorithm for the game 2048? How can I sort a std::map first by value, then by key? Finding square root without using sqrt function? Fastest way to flatten / un-flatten nested JSON objects Mergesort with Python Find common substring between two strings

Examples related to performance

Why is 2 * (i * i) faster than 2 * i * i in Java? What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism? How to check if a key exists in Json Object and get its value Why does C++ code for testing the Collatz conjecture run faster than hand-written assembly? Most efficient way to map function over numpy array The most efficient way to remove first N elements in a list? Fastest way to get the first n elements of a List into an Array Why is "1000000000000000 in range(1000000000000001)" so fast in Python 3? pandas loc vs. iloc vs. at vs. iat? Android Recyclerview vs ListView with Viewholder

Examples related to duplicates

Remove duplicates from dataframe, based on two columns A,B, keeping row with max value in another column C Remove duplicates from a dataframe in PySpark How to "select distinct" across multiple data frame columns in pandas? How to find duplicate records in PostgreSQL Drop all duplicate rows across multiple columns in Python Pandas Left Join without duplicate rows from left table Finding duplicate integers in an array and display how many times they occurred How do I use SELECT GROUP BY in DataTable.Select(Expression)? How to delete duplicate rows in SQL Server? Python copy files to a new directory and rename if file name already exists