[sql] How do I find duplicates across multiple columns?

So I want to do something like this sql code below:

select s.id, s.name,s.city 
from stuff s
group by s.name having count(where city and name are identical) > 1

To produce the following, (but ignore where only name or only city match, it has to be on both columns):

id      name  city   
904834  jim   London  
904835  jim   London  
90145   Fred  Paris   
90132   Fred  Paris
90133   Fred  Paris

This question is related to sql sql-server sql-server-2008 duplicates

The answer is


A little late to the game on this post, but I found this way to be pretty flexible / efficient

select 
    s1.id
    ,s1.name
    ,s1.city 
from 
    stuff s1
    ,stuff s2
Where
    s1.id <> s2.id
    and s1.name = s2.name
    and s1.city = s2.city

You have to self join stuff and match name and city. Then group by count.

select 
   s.id, s.name, s.city 
from stuff s join stuff p ON (
   s.name = p.city OR s.city = p.name
)
group by s.name having count(s.name) > 1

 SELECT name, city, count(*) as qty 
 FROM stuff 
 GROUP BY name, city HAVING count(*)> 1

Something like this will do the trick. Don't know about performance, so do make some tests.

select
  id, name, city
from
  [stuff] s
where
1 < (select count(*) from [stuff] i where i.city = s.city and i.name = s.name)

Using count(*) over(partition by...) provides a simple and efficient means to locate unwanted repetition, whilst also list all affected rows and all wanted columns:

SELECT
    t.*
FROM (
    SELECT
        s.*
      , COUNT(*) OVER (PARTITION BY s.name, s.city) AS qty
    FROM stuff s
    ) t
WHERE t.qty > 1
ORDER BY t.name, t.city

While most recent RDBMS versions support count(*) over(partition by...) MySQL V 8.0 introduced "window functions", as seen below (in MySQL 8.0)

CREATE TABLE stuff(
   id   INTEGER  NOT NULL
  ,name VARCHAR(60) NOT NULL
  ,city VARCHAR(60) NOT NULL
);
INSERT INTO stuff(id,name,city) VALUES 
  (904834,'jim','London')
, (904835,'jim','London')
, (90145,'Fred','Paris')
, (90132,'Fred','Paris')
, (90133,'Fred','Paris')

, (923457,'Barney','New York') # not expected in result
;
SELECT
    t.*
FROM (
    SELECT
        s.*
      , COUNT(*) OVER (PARTITION BY s.name, s.city) AS qty
    FROM stuff s
    ) t
WHERE t.qty > 1
ORDER BY t.name, t.city
    id | name | city   | qty
-----: | :--- | :----- | --:
 90145 | Fred | Paris  |   3
 90132 | Fred | Paris  |   3
 90133 | Fred | Paris  |   3
904834 | jim  | London |   2
904835 | jim  | London |   2

db<>fiddle here

Window functions. MySQL now supports window functions that, for each row from a query, perform a calculation using rows related to that row. These include functions such as RANK(), LAG(), and NTILE(). In addition, several existing aggregate functions now can be used as window functions; for example, SUM() and AVG(). For more information, see Section 12.21, “Window Functions”.


Given a staging table with 70 columns and only 4 representing duplicates, this code will return the offending columns:

SELECT 
    COUNT(*)
    ,LTRIM(RTRIM(S.TransactionDate)) 
    ,LTRIM(RTRIM(S.TransactionTime))
    ,LTRIM(RTRIM(S.TransactionTicketNumber)) 
    ,LTRIM(RTRIM(GrossCost)) 
FROM Staging.dbo.Stage S
GROUP BY 
    LTRIM(RTRIM(S.TransactionDate)) 
    ,LTRIM(RTRIM(S.TransactionTime))
    ,LTRIM(RTRIM(S.TransactionTicketNumber)) 
    ,LTRIM(RTRIM(GrossCost)) 
HAVING COUNT(*) > 1

.


Examples related to sql

Passing multiple values for same variable in stored procedure SQL permissions for roles Generic XSLT Search and Replace template Access And/Or exclusions Pyspark: Filter dataframe based on multiple conditions Subtracting 1 day from a timestamp date PYODBC--Data source name not found and no default driver specified select rows in sql with latest date for each ID repeated multiple times ALTER TABLE DROP COLUMN failed because one or more objects access this column Create Local SQL Server database

Examples related to sql-server

Passing multiple values for same variable in stored procedure SQL permissions for roles Count the Number of Tables in a SQL Server Database Visual Studio 2017 does not have Business Intelligence Integration Services/Projects ALTER TABLE DROP COLUMN failed because one or more objects access this column Create Local SQL Server database How to create temp table using Create statement in SQL Server? SQL Query Where Date = Today Minus 7 Days How do I pass a list as a parameter in a stored procedure? SQL Server date format yyyymmdd

Examples related to sql-server-2008

Violation of PRIMARY KEY constraint. Cannot insert duplicate key in object How to Use Multiple Columns in Partition By And Ensure No Duplicate Row is Returned SQL Server : How to test if a string has only digit characters Conversion of a varchar data type to a datetime data type resulted in an out-of-range value in SQL query Get last 30 day records from today date in SQL Server How to subtract 30 days from the current date using SQL Server Calculate time difference in minutes in SQL Server SQL Connection Error: System.Data.SqlClient.SqlException (0x80131904) SQL Server Service not available in service list after installation of SQL Server Management Studio How to delete large data of table in SQL without log?

Examples related to duplicates

Remove duplicates from dataframe, based on two columns A,B, keeping row with max value in another column C Remove duplicates from a dataframe in PySpark How to "select distinct" across multiple data frame columns in pandas? How to find duplicate records in PostgreSQL Drop all duplicate rows across multiple columns in Python Pandas Left Join without duplicate rows from left table Finding duplicate integers in an array and display how many times they occurred How do I use SELECT GROUP BY in DataTable.Select(Expression)? How to delete duplicate rows in SQL Server? Python copy files to a new directory and rename if file name already exists