What s the best way to dedupe a table

Question

I ve seen a couple of solutions for this  but I m wondering what the best and most efficient way is to de-dupe a table  You can use code  SQL  etc   to illustrate your point  but I m just looking for basic algorithms  I assumed there would already be a question about this on SO  but I wasn t able to find one  so if it already exists just give me a heads up    Just to clarify - I m referring to getting rid of duplicates in a table that has an incremental automatic PK and has some rows that are duplicates in everything but the PK field

User · Answer

I think this should require nothing more then just grouping by all columns except the id and choosing one row from every group - for simplicity just the first row, but this does not actually matter besides you have additional constraints on the id.

Or the other way around to get rid of the rows ... just delete all rows accept a single one from all groups.

User · Answer

Ran into the problem today  none of the existing answers helped me  Assume you want to deduplicate your table named your table   Step 1  Create a new table with deduped values  If borrowed this code from somewhere else on StackOverflow but can t seem to find it again  It works fine against PostgreSQL  It creates a table your table deduped where  col1  col2  are unique   CREATE TABLE your table deduped AS SELECT   FROM your table WHERE ctid NOT IN  SELECT ctid FROM    SELECT ctid  ROW NUMBER   OVER      PARTITION BY col1  col2 ORDER BY ctid  AS rnum   FROM your table  t WHERE t rnum  gt  1     Step 2  Replace your first table with the deduped copy  We only delete the values in this step  because it allows us to keep the indexes  constraints  etc  in your table   DELETE FROM your table  INSERT INTO your table SELECT   FROM your table deduped    Step 3  Delete the deduped copy  DROP TABLE site daily kpis dedup    And voila  you have deduplicated your table

User · Answer

Here s the method I use if you can get your dupe criteria into a group by statement and your table has an id identity column for uniqueness   delete t from tablename t inner join         select date time  min id  as min id     from tablename     group by date time     having count     gt  1   t2 on t date time   t2 date time where t id  gt  t2 min id   In this example the date time is the grouping criteria  if you have more than one column make sure to join on all of them

User · Answer

delete from yourTable  where Id not in       select min id       from yourTable     group by  lt Unique Columns gt      where id is whatever is your unique id in the table    Could be customerNumber or whatever   If you don t have a Unique Id  you can add one  every SQL table should already have Id as first column  but  ALTER TABLE yourTable ADD Id int identity 1 1    Do your delete  above  and then drop the column   Better than creating a whole new table  or any of the other cryptic stuff I ve seen   Note  pretty much the same as in a comment here  but this is what I ve done for years

User · Answer

You could generate a hash for each row  excluding the PK   store it in a new column  or if you can t add new columns  can you move the table to a temp staging area    and then look for all other rows with the same hash  Of course  you would have to be able to ensure that your hash function doesn t produce the same code for different rows   If two rows are duplicate  does it matter which you get rid of  Is it possible that other data are dependent on both of the duplicates  If so  you will have to go through a few steps    Find the dupes Choose one of them as dupeA to eliminate Find all data dependent on dupeA Alter that data to refer to dupeB delete dupeA    This could be easy or complicated  depending on your existing data model   This whole scenario sounds like a maintenance and redesign project  If so  best of luck

User · Answer

Adding the actual code here for future reference   So  there are 3 steps  and therefore 3 SQL statements   Step 1  Move the non duplicates  unique tuples  into a temporary table  CREATE TABLE new table as SELECT   FROM old table WHERE 1 GROUP BY  column to remove duplicates by     Step 2  delete the old table  or rename it  We no longer need the table with all the duplicate entries  so drop it   DROP TABLE old table    Step 3  rename the new table to the name of the old table  RENAME TABLE new table TO old table    And of course  don t forget to fix your buggy code to stop inserting duplicates

User · Answer

Here s one I ve run into  in real life   Assume you have a table of external 3rd party logins for users  and you re going to merge two users and want to dedupe on the provider provider key values        WITH Logins AS               SELECT  LoginId   UserId   Provider   ProviderKey          FROM  dbo   UserLogin           WHERE  UserId   FromUserID -- is the user we re deleting               OR  UserId   ToUserID -- is the user we re moving data to        Ranked AS                SELECT Logins                  Picker  ROW NUMBER   OVER                          PARTITION BY  Provider   ProviderKey                         ORDER BY CASE WHEN  UserId   FromUserID THEN 1 ELSE 0 END          FROM Logins           MERGE Logins AS T     USING Ranked AS S     ON S  LoginId  T  LoginID      WHEN MATCHED AND S  Picker  gt 1 -- duplicate Provider ProviderKey                  AND T  UserID   FromUserID -- safety check      THEN DELETE     WHEN MATCHED AND S  Picker  1 -- the only or best one                  AND T  UserID   FromUserID     THEN UPDATE SET T  UserID   ToUserID     OUTPUT  action  DELETED    INSERTED

User · Answer

For SQL  you may use the INSERT IGNORE INTO table SELECT xy FROM unkeyed table   For an algorithm  if you can assume that to-be-primary keys may be repeated  but a to-be-primary-key uniquely identifies the content of the row  than hash only the to-be-primary key and check for repetition

User · Answer

These methods will work  but without an explicit id as a PK then determining which rows to delete could be a problem  The bounce out into a temp table delete from original and re-insert without the dupes seems to be the simplest

User · Answer

For deduplicate   dedupe   remove duplication   remove repeated rows                         there are multiple ways    If duplicated rows are exact the same  use group by   create table TABLE NAME DEDUP as select column1  column2       all column names   from TABLE NAME group by column1  column2    -- all column names   Then TABLE NAME DEDUP  is the deduplicated table   For example   create table test  t1 varchar 5   t2 varchar 5    insert into test  values   12345    ssdlh    insert into test  values   12345    ssdlh    create table test dedup as select   from test  group by t1  t2  -----optional --remove original table and rename dedup table to previous table --this is not recommend in dev or qa  DROP table test  Alter table test dedup rename to test     You have a rowid  the rowid has duplication but other columns are different  Records partial same  this may happened in a transactional system while update a row  and the rows failed to update will have nulls  You want to remove the duplication  create table test dedup as select column1  column2       all column names   from      select       row number   over  partition by rowid order by column1  column2       all column names except rowid     as cn from test   where cn  1   This is using the feature that when you use order by  the null value will be ordered behind the non-null value   create table test  rowid  varchar 5   t1 varchar 5   t2 varchar 5    insert into test  values   12345    ssdlh   null   insert into test  values   12345    ssdlh    lhbzj    create table test dedup as select rowid   t1  t2 from  select       row number   over  partition by rowid  order by t1  t2  as cn   from  test   where cn  1     -----optional --remove original table and rename dedup table to previous table --this is not recommend in dev or qa  DROP table test  Alter table test dedup rename to test

User · Answer

SELECT DISTINCT  lt insert all columns but the PK here gt  FROM foo  Create a temp table using that query  the syntax varies by RDBMS but there s typically a SELECT     INTO or CREATE TABLE AS pattern available   then blow away the old table and pump the data from the temp table back into it

User · Answer

Using analytic function row number   WITH CTE  col1  col2  dupcnt  AS   SELECT col1  col2  ROW NUMBER   OVER  PARTITION BY col1  col2 ORDER BY col1  AS dupcnt FROM Youtable   DELETE FROM CTE WHERE dupcnt  gt  1 GO

User · Answer

For those of you who prefer a quick and dirty approach  just list all the columns that together define a unique record and create a unique index with those columns  like so   ALTER IGNORE TABLE TABLE NAME ADD UNIQUE  column1 column2 column3   You can drop the unique index afterwords

User · Answer

I am taking the one from DShook and providing a dedupe example where you would keep only the record with the highest date    In this example say I have 3 records all with the same app id  and I only want to keep the one with the highest date   DELETE t FROM  USER OUTBOX APPS t INNER JOIN         SELECT           app id          max processed date  as max processed date     FROM  USER OUTBOX APPS     GROUP BY app id     HAVING count     gt  1   t2 on      t app id   t2 app id WHERE      t processed date  lt  t2 max processed date

User · Answer

This can dedupe the duplicated values in c1   select   from foo minus select f1   from foo f1  foo f2 where f1 c1   f2 c1 and f1 c2  gt  f2 c2

User · Answer

Deduping is rarely simple  That s because the records to be dedupped often have slightly different values is some of the fields  Therefore choose which record to keep can be problematic  Further  dups are often people records and it is hard to identify if the two John Smith s are two people or one person who is duplicated  So spend a lot  50  or more of the whole project  of your time defining what constitutes a dup and how to handle the differences and child records    How do you know which is the correct value  Further dedupping requires that you handle all child records not orphaning any  What happens when you find that by changing the id on the child record you are suddenly violating one of the unique indexes or constraints - this will happen eventually and your process needs to handle it  If you have chosen foolishly to apply all your constraints only thorough the application  you may not even know the constraints are violated  When you have 10 000 records to dedup  you aren t going to go through the  application to dedup one at a time  If the constraint isn t in the database  lots of luck in maintaining data integrity when you dedup   A further complication is that dups don t always match exactly on the name or address  For instance a salesrep named Joan Martin may be a dup of a sales rep names Joan Martin-Jones especially if they have the same address and email  OR you could have John or Johnny in the name  Or the same street address except one record abbreveiated ST  and one spelled out Street  In SQL server you can use SSIS and fuzzy grouping to also identify near matches   These are often the most common dups as the fact that weren t exact matches is why they got put in as dups in the first place   For some types of dedupping  you may need a user interface  so that the person doing the dedupping can choose which of two values to use for a particular field  This is especially true if the person who is being dedupped is in two or more roles  It could be that the data for a particular role is usually better than the data for another role  Or it could be that only the users will know for sure which is the correct value or they may need to contact people to find out if they are genuinely dups or simply two people with the same name

[sql] What's the best way to dedupe a table?

Examples related to sql

Examples related to algorithm

Examples related to performance

Examples related to duplicates