[sql] Select rows having 2 columns equal value

For question 1:

SELECT DISTINCT a.*
  FROM [Table] a
  INNER JOIN
  [Table] b
  ON
  a.C1 <> b.C1 AND a.C2 = b.C2 AND a.C3 = b.C3 AND a.C4 = b.C4

Using an inner join is much more efficient than a subquery because it requires fewer operations, and maintains the use of indexes when comparing the values, allowing the SQL server to better optimize the query before its run. Using appropriate indexes with this query can bring your query down to only n * log(n) rows to compare.

Using a subquery with your where clause or only doing a standard join where C1 does not equal C2 results in a table that has roughly 2 to the power of n rows to compare, where n is the number of rows in the table.

So by using proper indexing with an Inner Join, which only returns records which met the join criteria, we're able to drastically improve the performance. Also note that we return DISTINCT a.*, because this will only return the columns for table a where the join criteria was met. Returning * would return the columns for both a and b where the criteria was met, and not including DISTINCT would result in a duplicate of each row for each time that row row matched another row more than once.

A similar approach could also be performed using CROSS APPLY, which still uses a subquery, but makes use of indexes more efficiently.

An implementation with the keyword USING instead of ON could also work, but the syntax is more complicated to make work because your want to match on rows where C1 does not match, so you would need an additional where clause to filter out matching each row with itself. Also, USING is not compatible/allowed in conjunction with table values in all implementations of SQL, so it's best to stick with ON.

Similarly, for question 2:

SELECT DISTINCT a.*
  FROM [Table] a
  INNER JOIN
  [Table] b
  ON
  a.C1 <> b.C1 AND a.C4 = b.C4

This is essentially the same query as for 1, but because it only wants to know which rows match for C4, we only compare on the rows for C4.