How can I remove duplicate rows

Question

What is the best way to remove duplicate rows from a fairly large SQL Server table  i e  300 000  rows    The rows  of course  will not be perfect duplicates because of the existence of the RowID identity field   MyTable  RowID int not null identity 1 1  primary key  Col1 varchar 20  not null  Col2 varchar 2048  not null  Col3 tinyint not null

User · Answer

Oh sure  Use a temp table  If you want a single  not-very-performant statement that  works  you can go with   DELETE FROM MyTable WHERE NOT RowID IN      SELECT           SELECT TOP 1 RowID FROM MyTable mt2          WHERE mt2 Col1   mt Col1          AND mt2 Col2   mt Col2          AND mt2 Col3   mt Col3       FROM MyTable mt    Basically  for each row in the table  the sub-select finds the top RowID of all rows that are exactly like the row under consideration  So you end up with a list of RowIDs that represent the  original  non-duplicated rows

User · Answer

I had a table where I needed to preserve non-duplicate rows  I m not sure on the speed or efficiency   DELETE FROM myTable WHERE RowID IN     SELECT MIN RowID  AS IDNo FROM myTable   GROUP BY Col1  Col2  Col3   HAVING COUNT      2

User · Answer

To Fetch Duplicate Rows   SELECT name  email  COUNT    FROM  users GROUP BY name  email HAVING COUNT     gt  1   To Delete the Duplicate Rows   DELETE users  WHERE rowid NOT IN   SELECT MIN rowid  FROM users GROUP BY name  email

User · Answer

The other way is Create a new table with same fields and with Unique Index  Then move all data from old table to new table  Automatically SQL SERVER ignore  there is also an option about what to do if there will be a duplicate value  ignore  interrupt or sth  duplicate values  So we have the same table without duplicate rows  If you don t want Unique Index  after the transfer data you can drop it   Especially for larger tables you may use DTS  SSIS package to import export data  in order to transfer all data rapidly to your new uniquely indexed table  For 7 million row it takes just a few minute

User · Answer

Here is another good article on removing duplicates   It discusses why its hard   SQL is based on relational algebra  and duplicates cannot occur in relational algebra  because duplicates are not allowed in a set    The temp table solution  and two mysql examples   In the future are you going to prevent it at a database level  or from an application perspective   I would suggest the database level because your database should be responsible for maintaining referential integrity  developers just will cause problems

User · Answer

This is the easiest way to delete duplicate record  DELETE FROM tblemp WHERE id IN       SELECT MIN id  FROM tblemp    GROUP BY  title HAVING COUNT id  gt 1

User · Answer

Another way of doing this  --  DELETE A FROM   TABLE A         TABLE B WHERE  A COL1   B COL1        AND A COL2   B COL2        AND A UNIQUEFIELD  gt  B UNIQUEFIELD

User · Answer

CREATE TABLE car Id int identity 1 1   PersonId int  CarId int   INSERT INTO car PersonId CarId  VALUES 1 2   1 3   1 2   2 4   --SELECT   FROM car   WITH CTE as  SELECT ROW NUMBER   over  PARTITION BY personid carid order by personid carid  as rn Id PersonID CarId from car   DELETE FROM car where Id in SELECT Id FROM CTE WHERE rn gt 1

User · Answer

Create new blank table with the same structure Execute query like this  INSERT INTO tc category1 SELECT   FROM tc category GROUP BY category id  application id HAVING count     gt  1  Then execute this query  INSERT INTO tc category1 SELECT   FROM tc category GROUP BY category id  application id HAVING count      1

User · Answer

Now lets look elasticalsearch table which this tables has duplicated rows and Id is identical uniq field  We know if some id exist by a group criteria then we can delete other rows outscope of this group  My manner shows this criteria   So many case of this thread are in the like state of mine  Just change your target group criteria according your case for deleting repeated  duplicated  rows   DELETE  FROM elasticalsearch WHERE Id NOT IN                  SELECT min Id                       FROM elasticalsearch                      GROUP BY FirmId FilterSearchString                           cheers

User · Answer

I would mention this approach as well as it can be helpful  and works in all SQL servers  Pretty often there is only one - two duplicates  and  Ids and count of duplicates are known  In this case   SET ROWCOUNT 1 -- or set to number of rows to be deleted delete from myTable where RowId   DuplicatedID SET ROWCOUNT 0

User · Answer

If all the columns in duplicate rows are same then below query can be used to delete the duplicate records   SELECT DISTINCT   INTO  TemNewTable FROM  OriginalTable TRUNCATE TABLE  OriginalTable INSERT INTO  OriginalTable SELECT   FROM  TemNewTable DROP TABLE  TemNewTable

User · Answer

I prefer the subquery having count      1 solution to the inner join because I found it easier to read and it was very easy to turn into a SELECT statement to verify what would be deleted before you run it    --DELETE FROM table1  --WHERE id IN         SELECT MIN id  FROM table1       GROUP BY col1  col2  col3       -- could add a WHERE clause here to further filter      HAVING count     gt  1 --

User · Answer

Quick and Dirty to delete exact duplicated rows  for small tables    select  distinct   into t2 from t1  delete from t1  insert into t1 select    from t2  drop table t2

User · Answer

By useing below query we can able to delete duplicate records based on the single column or multiple column  below query is deleting based on two columns  table name is  testing and column names empno empname  DELETE FROM testing WHERE empno not IN  SELECT empno FROM  SELECT empno  ROW NUMBER   OVER  PARTITION BY empno ORDER BY empno   AS  ItemNumber  FROM testing  a WHERE ItemNumber  gt  1  or empname not in  select empname from  select empname row number   over PARTITION BY empno ORDER BY empno   AS  ItemNumber  FROM testing  a WHERE ItemNumber  gt  1

User · Answer

For the table structure MyTable RowID int not null identity 1 1  primary key  Col1 varchar 20  not null  Col2 varchar 2048  not null  Col3 tinyint not null  The query for removing duplicates  DELETE t1 FROM MyTable t1 INNER JOIN MyTable t2 WHERE t1 RowID  gt  t2 RowID   AND t1 Col1   t2 Col1   AND t1 Col2 t2 Col2   AND t1 Col3 t2 Col3    I am assuming that RowID is kind of auto-increment and rest of the columns have duplicate values

User · Answer

I would prefer CTE for deleting duplicate rows from sql server table  strongly recommend to follow this article   http   codaffection com sql-server-article delete-duplicate-rows-in-sql-server      by keeping original   WITH CTE AS   SELECT   ROW NUMBER   OVER  PARTITION BY col1 col2 col3 ORDER BY col1 col2 col3  AS RN FROM MyTable    DELETE FROM CTE WHERE RN lt  gt 1      without keeping original   WITH CTE AS  SELECT   R RANK   OVER  ORDER BY col1 col2 col3  FROM MyTable     DELETE CTE WHERE R IN  SELECT R FROM CTE GROUP BY R HAVING COUNT    gt 1

User · Answer

DELETE LU  FROM    SELECT                    Row number                     OVER                       partition BY col1  col1  col3                     ORDER BY rowid DESC   Row           FROM   mytable  LU  WHERE   row   gt  1

User · Answer

Sometimes a soft delete mechanism is used where a date is recorded to indicate the deleted date  In this case an UPDATE statement may be used to update this field based on duplicate entries   UPDATE MY TABLE    SET DELETED   getDate    WHERE TABLE ID IN       SELECT x TABLE ID       FROM MY TABLE x       JOIN  SELECT min TABLE ID  id  COL 1  COL 2  COL 3               FROM MY TABLE d              GROUP BY d COL 1  d COL 2  d COL 3             HAVING count     gt  1  AS d ON d COL 1   x COL 1                                      AND d COL 2   x COL 2                                      AND d COL 3   x COL 3                                      AND d TABLE ID  lt  gt  x TABLE ID                WHERE x COL 4  lt  gt   D  -- Additional filter      This method has served me well for fairly moderate tables containing  30 million rows with high and low amounts of duplications

User · Answer

I know that this question has been already answered  but I ve created pretty useful sp which will create a dynamic delete statement for a table duplicates       CREATE PROCEDURE sp DeleteDuplicate  tableName varchar 100    DebugMode int  1 AS  BEGIN SET NOCOUNT ON   IF OBJECT ID  tempdb   tableMatrix   is not null  DROP TABLE  tableMatrix   SELECT ROW NUMBER   OVER ORDER BY name  as rn name into  tableMatrix FROM sys columns where  object id    object id  tableName  ORDER BY name  DECLARE  MaxRow int    SELECT MAX rn  from  tableMatrix  IF  MaxRow is null      RAISERROR    I wasn  t able to find any columns for this table   16 1  ELSE      BEGIN DECLARE  i int  1  DECLARE  Columns Varchar max        WHILE   i  lt    MaxRow  BEGIN      SET  Columns  Columns  SELECT     name      from  tableMatrix where rn    i       SET  i    i 1  END  ---DELETE LAST comma SET  Columns   LEFT  Columns LEN  Columns -1   DECLARE  Sql nvarchar max      WITH cteRowsToDelte      AS   SELECT ROW NUMBER   OVER  PARTITION BY    Columns   ORDER BY   SELECT 0   as rowNumber   FROM    tableName      DELETE FROM cteRowsToDelte WHERE  rowNumber  gt  1    SET NOCOUNT OFF      IF  DebugMode   1         SELECT  Sql     ELSE        EXEC sp executesql  Sql     END END   So if you create table like that   IF OBJECT ID  MyLitleTable   is not null      DROP TABLE MyLitleTable    CREATE TABLE MyLitleTable       A Varchar 10       B money      C int   ---------------------------------------------------------      INSERT INTO MyLitleTable VALUES       ABC  100 1         ABC  100 1   -- only this row should be deleted       ABC  101 1         ABC  100 2         ABCD  100 1       -----------------------------------------------------------       exec sp DeleteDuplicate  MyLitleTable  0   It will delete all duplicates from your table  If you run it without the second parameter it will return a SQL statement to run   If you need to exclude any of the column just run it in the debug mode get the code and modify it whatever you like

User · Answer

DELETE  FROM MyTable WHERE NOT EXISTS                 SELECT min RowID                FROM Mytable               WHERE  SELECT RowID                       FROM Mytable                      GROUP BY Col1  Col2  Col3

User · Answer

I think this would be helpfull  Here  ROW NUMBER   OVER PARTITION BY res1 Title ORDER BY res1 Id as num has been used to differentiate  the duplicate rows   delete FROM  SELECT res1   ROW NUMBER   OVER PARTITION BY res1 Title ORDER BY res1 Id as num  FROM   select   from  dbo   tbl countries  as res1  as res2 WHERE res2 num  gt  1

User · Answer

SELECT  DISTINCT         INTO tempdb dbo tmpTable FROM myTable  TRUNCATE TABLE myTable INSERT INTO myTable SELECT   FROM tempdb dbo tmpTable DROP TABLE tempdb dbo tmpTable

User · Answer

Use this  WITH tblTemp as   SELECT ROW NUMBER   Over PARTITION BY Name Department ORDER BY Name     As RowNumber   FROM  lt table name gt    DELETE FROM tblTemp where RowNumber  gt 1

User · Answer

alter table MyTable add sno int identity 1 1      delete from MyTable where sno in           select sno from       select        RANK   OVER   PARTITION BY RowID Col3 ORDER BY sno DESC  rank     From MyTable      T     where rank gt 1            alter table MyTable      drop  column sno

User · Answer

Assuming no nulls  you GROUP BY the unique columns  and SELECT the MIN  or MAX  RowId as the row to keep  Then  just delete everything that didn t have a row id   DELETE FROM MyTable LEFT OUTER JOIN      SELECT MIN RowId  as RowId  Col1  Col2  Col3     FROM MyTable     GROUP BY Col1  Col2  Col3   as KeepRows ON    MyTable RowId   KeepRows RowId WHERE    KeepRows RowId IS NULL   In case you have a GUID instead of an integer  you can replace  MIN RowId    with  CONVERT uniqueidentifier  MIN CONVERT char 36   MyGuidColumn

User · Answer

The following query is useful to delete duplicate rows  The table in this example has ID as an identity column and the columns which have duplicate data are Column1  Column2 and Column3   DELETE FROM TableName WHERE  ID NOT IN  SELECT MAX ID                    FROM   TableName                   GROUP  BY Column1                              Column2                              Column3                     Even if ID is not null-able SQL Server treats MAX ID  as potentially                     nullable  Because of semantics of NOT IN  NULL  including the clause                     below can simplify the plan                     HAVING MAX ID  IS NOT NULL     The following script shows usage of GROUP BY  HAVING  ORDER BY in one query  and returns the results with duplicate column and its count   SELECT YourColumnName         COUNT    TotalCount FROM   YourTableName GROUP  BY YourColumnName HAVING COUNT     gt  1 ORDER  BY COUNT    DESC

User · Answer

I you want to preview the rows you are about to remove and keep control over which of the duplicate rows to keep  See http   developer azurewebsites net 2014 09 better-sql-group-by-find-duplicate-data   with MYCTE as     SELECT ROW NUMBER   OVER       PARTITION BY DuplicateKey1                  DuplicateKey2 -- optional     ORDER BY CreatedAt -- the first row among duplicates will be kept  other rows will be removed     RN   FROM MyTable   DELETE FROM MYCTE WHERE RN  gt  1

User · Answer

Another possible way of doing this is      --Ensure that any immediately preceding statement is terminated with a semicolon above WITH cte      AS  SELECT ROW NUMBER   OVER  PARTITION BY Col1  Col2  Col3                                         ORDER BY   SELECT 0   RN          FROM    MyTable  DELETE FROM cte WHERE  RN  gt  1    I am using ORDER BY  SELECT 0  above as it is arbitrary which row to preserve in the event of a tie    To preserve the latest one in RowID order for example you could use ORDER BY RowID DESC   Execution Plans  The execution plan for this is often simpler and more efficient than that in the accepted answer as it does not require the self join     This is not always the case however  One place where the GROUP BY solution might be preferred is situations where a hash aggregate would be chosen in preference to a stream aggregate    The ROW NUMBER solution will always give pretty much the same plan whereas the GROUP BY strategy is more flexible     Factors which might favour the hash aggregate approach would be    No useful index on the partitioning columns relatively fewer groups with relatively more duplicates in each group    In extreme versions of this second case  if there are very few groups with many duplicates in each  one could also consider simply inserting the rows to keep into a new table then TRUNCATE-ing the original and copying them back to minimise logging compared to deleting a very high proportion of the rows

User · Answer

There s a good article on removing duplicates on the Microsoft Support site  It s pretty conservative - they have you do everything in separate steps - but it should work well against large tables   I ve used self-joins to do this in the past  although it could probably be prettied up with a HAVING clause   DELETE dupes FROM MyTable dupes  MyTable fullTable WHERE dupes dupField   fullTable dupField  AND dupes secondDupField   fullTable secondDupField  AND dupes uniqueField  gt  fullTable uniqueField

User · Answer

This will delete duplicate rows  except the first row  DELETE FROM     Mytable WHERE     RowID NOT IN           SELECT             MIN RowID          FROM             Mytable         GROUP BY             Col1              Col2              Col3         Refer  http   www codeproject com Articles 157977 Remove-Duplicate-Rows-from-a-Table-in-SQL-Server

User · Answer

From the application level  unfortunately   I agree that the proper way to prevent duplication is at the database level through the use of a unique index  but in SQL Server 2005  an index is allowed to be only 900 bytes  and my varchar 2048  field blows that away    I dunno how well it would perform  but I think you could write a trigger to enforce this  even if you couldn t do it directly with an index   Something like   -- given a table stories story id int not null primary key  story varchar max  not null  CREATE TRIGGER prevent plagiarism  ON stories  after INSERT  UPDATE  AS      DECLARE  cnt AS INT       SELECT  cnt   Count         FROM   stories             INNER JOIN inserted                     ON   stories story   inserted story                          AND stories story id    inserted story id         IF  cnt  gt  0        BEGIN            RAISERROR  plagiarism detected  16 1              ROLLBACK TRANSACTION        END    Also  varchar 2048  sounds fishy to me  some things in life are 2048 bytes  but it s pretty uncommon   should it really not be varchar max

User · Answer

delete t1 from table t1  table t2 where t1 columnA   t2 columnA and t1 rowid gt t2 rowid   Postgres    delete from table t1 using table t2 where t1 columnA   t2 columnA and t1 rowid  gt  t2 rowid

User · Answer

Yet another easy solution can be found at the link pasted here  This one easy to grasp and seems to be effective for most of the similar problems  It is for SQL Server though but the concept used is more than acceptable   Here are the relevant portions from the linked page   Consider this data   EMPLOYEE ID ATTENDANCE DATE A001    2011-01-01 A001    2011-01-01 A002    2011-01-01 A002    2011-01-01 A002    2011-01-01 A003    2011-01-01   So how can we delete those duplicate data   First  insert an identity column in that table by using the following code   ALTER TABLE dbo ATTENDANCE ADD AUTOID INT IDENTITY 1 1      Use the following code to resolve it   DELETE FROM dbo ATTENDANCE WHERE AUTOID NOT IN  SELECT MIN AUTOID        FROM dbo ATTENDANCE GROUP BY EMPLOYEE ID ATTENDANCE DATE

User · Answer

Using CTE  The idea is to join on one or more columns that form a duplicate record and then remove whichever you like    with cte as       select          min PrimaryKey  as PrimaryKey         UniqueColumn1          UniqueColumn2     from dbo DuplicatesTable      group by         UniqueColumn1  UniqueColumn1     having count     gt  1   delete d from dbo DuplicatesTable d  inner join cte on      d PrimaryKey  gt  cte PrimaryKey and     d UniqueColumn1   cte UniqueColumn1 and      d UniqueColumn2   cte UniqueColumn2

User · Answer

DELETE FROM     table name T1 WHERE     rowid  gt            SELECT             min rowid          FROM             table name T2         WHERE             T1 column name   T2 column name

User · Answer

I thought I d share my solution since it works under special circumstances  I my case the table with duplicate values did not have a foreign key  because the values were duplicated from another db    begin transaction -- create temp table with identical structure as source table Select   Into  temp From tableName Where 1   2  -- insert distinct values into temp insert into  temp  select distinct    from  tableName  -- delete from source delete from tableName   -- insert into source from temp insert into tableName  select    from  temp  rollback transaction -- if this works  change rollback to commit and execute again to keep you changes     PS  when working on things like this I always use a transaction  this not only ensures everything is executed as a whole  but also allows me to test without risking anything  But off course you should take a backup anyway just to be sure

User · Answer

This query showed very good performance for me   DELETE tbl FROM     MyTable tbl WHERE     EXISTS           SELECT                       FROM             MyTable tbl2         WHERE             tbl2 SameValue   tbl SameValue         AND tbl IdUniqueValue  lt  tbl2 IdUniqueValue         it deleted 1M rows in little more than 30sec from a table of 2M  50  duplicates

[sql-server] How can I remove duplicate rows?

Examples related to sql-server

Examples related to tsql

Examples related to duplicates