Why do multiple-table joins produce duplicate rows

Question

Let s say I have three tables A  B  and C  Each has two columns  a primary key and some other piece of data  They each have the same number of rows  If I JOIN A and B on the primary key  I should end up with the same number of rows as are in either of them  as opposed to A rows   B rows    Now  if I JOIN A JOIN B with C  why do I end up with duplicate rows  I have run into this problem on several occasions and I do not understand it  It seems like it should produce the same result as JOINing A and B since it has the same number of rows but  instead  duplicates are produced   Queries that produce results like this are of the format  SELECT   FROM M     INNER JOIN S         on M mIndex   S mIndex     INNER JOIN D         ON M platformId LIKE       D version           INNER JOIN H         ON D Name   H Name         AND D revision   H revision   Here are schemas for the tables  H contains is a historic table containing everything that was ever in D  There are many M rows for each D and one S for each M   Table M       mIndex   int  NOT NULL PRIMARY KEY       platformId   nvarchar  256  NULL       ip   nvarchar  64  NULL       complete   bit  NOT NULL       date   datetime  NOT NULL       DeployId   int  NOT NULL PRIMARY KEY REFERENCES D DeployId       source   nvarchar  64  NOT NULL PRIMARY KEY   Table S   order   int  NOT NULL PRIMARY KEY   name   nvarchar  64  NOT NULL   parameters   nvarchar  256  NOT NULL   Finished   bit  NOT NULL   mIndex   int  NOT NULL PRIMARY KEY   mDeployId   int  NOT NULL PRIMARY KEY   Date   datetime  NULL   status   nvarchar  10  NULL   output   nvarchar  max  NULL   config   nvarchar  64  NOT NULL PRIMARY KEY   Table D   Id   int  IDENTITY 1 1  NOT NULL PRIMARY KEY   branch   nvarchar  64  NOT NULL   revision   int  NOT NULL   version   nvarchar  64  NOT NULL   path   nvarchar  256  NOT NULL   Table H   IdDeploy   int  IDENTITY 1 1  NOT NULL   name   nvarchar  64  NOT NULL   version   nvarchar  64  NOT NULL   path   nvarchar  max  NOT NULL   StartDate   datetime  NOT NULL   EndDate   datetime  NULL   Revision   nvarchar  64  NULL    I didn t post the tables and query initially because I am more interested in understanding this problem for myself and avoiding it in the future

User · Answer

When you have related tables you often have one-to-many or many-to-many relationships. So when you join to TableB each record in TableA many have multiple records in TableB. This is normal and expected.

Now at times you only need certain columns and those are all the same for all the records, then you would need to do some sort of group by or distinct to remove the duplicates. Let's look at an example:

TableA
Id Field1
1  test
2  another test

TableB
ID Field2 field3
1  Test1  something
1  test1  More something
2  Test2  Anything

So when you join them and select all the files you get:

select * 
from tableA a 
join tableb b on a.id = b.id

a.Id a.Field1        b.id   b.field2  b.field3
1    test            1      Test1     something
1    test            1      Test1     More something
2    another test 2  2      Test2     Anything

These are not duplicates because the values of Field3 are different even though there are repeated values in the earlier fields. Now when you only select certain columns the same number of records are being joined together but since the columns with the different information is not being displayed they look like duplicates.

select a.Id, a.Field1,  b.field2
from tableA a 
join tableb b on a.id = b.id

a.Id a.Field1       b.field2  
1    test           Test1     
1    test           Test1 
2    another test   Test2

This appears to be duplicates but it is not because of the multiple records in TableB.

You normally fix this by using aggregates and group by, by using distinct or by filtering in the where clause to remove duplicates. How you solve this depends on exactly what your business rule is and how your database is designed and what kind of data is in there.

User · Answer

Ok in this example you are getting duplicates because you are joining both D and S onto M  I assume you should be joining D id onto S id like below   SELECT   FROM M INNER JOIN S     on M Id   S Id INNER JOIN D     ON S Id   D Id INNER JOIN H     ON D Id   H Id

User · Answer

This might sound like a really basic  DUH  answer  but make sure that the column you re using to Lookup from on the merging file is actually full of unique values   I noticed earlier today that PowerQuery won t throw you an error  like in PowerPivot  and will happily allow you to run a Many-Many merge   This will result in multiple rows being produced for each record that matches with a non-unique value

User · Answer

If one of the tables M  S  D  or H has more than one row for a given Id  if just the Id column is not the Primary Key   then the query would result in  duplicate  rows  If you have more than one row for an Id in a table  then the other columns  which would uniquely identify a row  also must be included in the JOIN condition s    References   Related Question on MSDN Forum

[sql] Why do multiple-table joins produce duplicate rows?

Examples related to sql

Examples related to join