[sql] How to find duplicate records in PostgreSQL

I have a PostgreSQL database table called "user_links" which currently allows the following duplicate fields:

year, user_id, sid, cid

The unique constraint is currently the first field called "id", however I am now looking to add a constraint to make sure the year, user_id, sid and cid are all unique but I cannot apply the constraint because duplicate values already exist which violate this constraint.

Is there a way to find all duplicates?

This question is related to sql postgresql duplicates

The answer is


The basic idea will be using a nested query with count aggregation:

select * from yourTable ou
where (select count(*) from yourTable inr
where inr.sid = ou.sid) > 1

You can adjust the where clause in the inner query to narrow the search.


There is another good solution for that mentioned in the comments, (but not everyone reads them):

select Column1, Column2, count(*)
from yourTable
group by Column1, Column2
HAVING count(*) > 1

Or shorter:

SELECT (yourTable.*)::text, count(*)
FROM yourTable
GROUP BY yourTable.*
HAVING count(*) > 1

In order to make it easier I assume that you wish to apply a unique constraint only for column year and the primary key is a column named id.

In order to find duplicate values you should run,

SELECT year, COUNT(id)
FROM YOUR_TABLE
GROUP BY year
HAVING COUNT(id) > 1
ORDER BY COUNT(id);

Using the sql statement above you get a table which contains all the duplicate years in your table. In order to delete all the duplicates except of the the latest duplicate entry you should use the above sql statement.

DELETE
FROM YOUR_TABLE A USING YOUR_TABLE_AGAIN B
WHERE A.year=B.year AND A.id<B.id;

In your case, because of the constraint you need to delete the duplicated records.

  1. Find the duplicated rows
  2. Organize them by created_at date - in this case I'm keeping the oldest
  3. Delete the records with USING to filter the right rows
WITH duplicated AS ( 
    SELECT id,
        count(*) 
    FROM products 
    GROUP BY id 
    HAVING count(*) > 1), 
ordered AS ( 
    SELECT p.id, 
        created_at, 
        rank() OVER (partition BY p.id ORDER BY p.created_at) AS rnk 
    FROM products o 
    JOIN     duplicated d ON d.id = p.id ), 
products_to_delete AS ( 
    SELECT id, 
        created_at 
    FROM   ordered 
    WHERE  rnk = 2
) 
DELETE 
FROM products 
USING products_to_delete 
WHERE products.id = products_to_delete.id 
    AND products.created_at = products_to_delete.created_at;

From "Find duplicate rows with PostgreSQL" here's smart solution:

select * from (
  SELECT id,
  ROW_NUMBER() OVER(PARTITION BY column1, column2 ORDER BY id asc) AS Row
  FROM tbl
) dups
where 
dups.Row > 1

You can join to the same table on the fields that would be duplicated and then anti-join on the id field. Select the id field from the first table alias (tn1) and then use the array_agg function on the id field of the second table alias. Finally, for the array_agg function to work properly, you will group the results by the tn1.id field. This will produce a result set that contains the the id of a record and an array of all the id's that fit the join conditions.

select tn1.id,
       array_agg(tn2.id) as duplicate_entries, 
from table_name tn1 join table_name tn2 on 
    tn1.year = tn2.year 
    and tn1.sid = tn2.sid 
    and tn1.user_id = tn2.user_id 
    and tn1.cid = tn2.cid
    and tn1.id <> tn2.id
group by tn1.id;

Obviously, id's that will be in the duplicate_entries array for one id, will also have their own entries in the result set. You will have to use this result set to decide which id you want to become the source of 'truth.' The one record that shouldn't get deleted. Maybe you could do something like this:

with dupe_set as (
select tn1.id,
       array_agg(tn2.id) as duplicate_entries, 
from table_name tn1 join table_name tn2 on 
    tn1.year = tn2.year 
    and tn1.sid = tn2.sid 
    and tn1.user_id = tn2.user_id 
    and tn1.cid = tn2.cid
    and tn1.id <> tn2.id
group by tn1.id
order by tn1.id asc)
select ds.id from dupe_set ds where not exists 
 (select de from unnest(ds.duplicate_entries) as de where de < ds.id)

Selects the lowest number ID's that have duplicates (assuming the ID is increasing int PK). These would be the ID's that you would keep around.


Examples related to sql

Passing multiple values for same variable in stored procedure SQL permissions for roles Generic XSLT Search and Replace template Access And/Or exclusions Pyspark: Filter dataframe based on multiple conditions Subtracting 1 day from a timestamp date PYODBC--Data source name not found and no default driver specified select rows in sql with latest date for each ID repeated multiple times ALTER TABLE DROP COLUMN failed because one or more objects access this column Create Local SQL Server database

Examples related to postgresql

Subtracting 1 day from a timestamp date pgadmin4 : postgresql application server could not be contacted. Psql could not connect to server: No such file or directory, 5432 error? How to persist data in a dockerized postgres database using volumes input file appears to be a text format dump. Please use psql Postgres: check if array field contains value? Add timestamp column with default NOW() for new rows only Can't connect to Postgresql on port 5432 How to insert current datetime in postgresql insert query Connecting to Postgresql in a docker container from outside

Examples related to duplicates

Remove duplicates from dataframe, based on two columns A,B, keeping row with max value in another column C Remove duplicates from a dataframe in PySpark How to "select distinct" across multiple data frame columns in pandas? How to find duplicate records in PostgreSQL Drop all duplicate rows across multiple columns in Python Pandas Left Join without duplicate rows from left table Finding duplicate integers in an array and display how many times they occurred How do I use SELECT GROUP BY in DataTable.Select(Expression)? How to delete duplicate rows in SQL Server? Python copy files to a new directory and rename if file name already exists