[sql] Counting DISTINCT over multiple columns

Is there a better way of doing a query like this:

SELECT COUNT(*) 
FROM (SELECT DISTINCT DocumentId, DocumentSessionId
      FROM DocumentOutputItems) AS internalQuery

I need to count the number of distinct items from this table but the distinct is over two columns.

My query works fine but I was wondering if I can get the final result using just one query (without using a sub-query)

This question is related to sql sql-server performance tsql query-optimization

The answer is


Edit: Altered from the less-than-reliable checksum-only query I've discovered a way to do this (in SQL Server 2005) that works pretty well for me and I can use as many columns as I need (by adding them to the CHECKSUM() function). The REVERSE() function turns the ints into varchars to make the distinct more reliable

SELECT COUNT(DISTINCT (CHECKSUM(DocumentId,DocumentSessionId)) + CHECKSUM(REVERSE(DocumentId),REVERSE(DocumentSessionId)) )
FROM DocumentOutPutItems

I wish MS SQL could also do something like COUNT(DISTINCT A, B). But it can't.

At first JayTee's answer seemed like a solution to me bu after some tests CHECKSUM() failed to create unique values. A quick example is, both CHECKSUM(31,467,519) and CHECKSUM(69,1120,823) gives the same answer which is 55.

Then I made some research and found that Microsoft does NOT recommend using CHECKSUM for change detection purposes. In some forums some suggested using

SELECT COUNT(DISTINCT CHECKSUM(value1, value2, ..., valueN) + CHECKSUM(valueN, value(N-1), ..., value1))

but this is also not conforting.

You can use HASHBYTES() function as suggested in TSQL CHECKSUM conundrum. However this also has a small chance of not returning unique results.

I would suggest using

SELECT COUNT(DISTINCT CAST(DocumentId AS VARCHAR)+'-'+CAST(DocumentSessionId AS VARCHAR)) FROM DocumentOutputItems

How about this,

Select DocumentId, DocumentSessionId, count(*) as c 
from DocumentOutputItems 
group by DocumentId, DocumentSessionId;

This will get us the count of all possible combinations of DocumentId, and DocumentSessionId


It works for me. In oracle:

SELECT SUM(DECODE(COUNT(*),1,1,1))
FROM DocumentOutputItems GROUP BY DocumentId, DocumentSessionId;

In jpql:

SELECT SUM(CASE WHEN COUNT(i)=1 THEN 1 ELSE 1 END)
FROM DocumentOutputItems i GROUP BY i.DocumentId, i.DocumentSessionId;

This code uses distinct on 2 parameters and provides count of number of rows specific to those distinct values row count. It worked for me in MySQL like a charm.

select DISTINCT DocumentId as i,  DocumentSessionId as s , count(*) 
from DocumentOutputItems   
group by i ,s;

Hope this works i am writing on prima vista

SELECT COUNT(*) 
FROM DocumentOutputItems 
GROUP BY DocumentId, DocumentSessionId

I have used this approach and it has worked for me.

SELECT COUNT(DISTINCT DocumentID || DocumentSessionId) 
FROM  DocumentOutputItems

For my case, it provides correct result.


How about something like:

select count(*)
from
  (select count(*) cnt
   from DocumentOutputItems
   group by DocumentId, DocumentSessionId) t1

Probably just does the same as you are already though but it avoids the DISTINCT.


I had a similar question but the query I had was a sub-query with the comparison data in the main query. something like:

Select code, id, title, name 
(select count(distinct col1) from mytable where code = a.code and length(title) >0)
from mytable a
group by code, id, title, name
--needs distinct over col2 as well as col1

ignoring the complexities of this, I realized I couldn't get the value of a.code into the subquery with the double sub query described in the original question

Select count(1) from (select distinct col1, col2 from mytable where code = a.code...)
--this doesn't work because the sub-query doesn't know what "a" is

So eventually I figured out I could cheat, and combine the columns:

Select count(distinct(col1 || col2)) from mytable where code = a.code...

This is what ended up working


Many (most?) SQL databases can work with tuples like values so you can just do: SELECT COUNT(DISTINCT (DocumentId, DocumentSessionId)) FROM DocumentOutputItems; If your database doesn't support this, it can be simulated as per @oncel-umut-turer's suggestion of CHECKSUM or other scalar function providing good uniqueness e.g. COUNT(DISTINCT CONCAT(DocumentId, ':', DocumentSessionId)).

A related use of tuples is performing IN queries such as: SELECT * FROM DocumentOutputItems WHERE (DocumentId, DocumentSessionId) in (('a', '1'), ('b', '2'));


There's nothing wrong with your query, but you could also do it this way:

WITH internalQuery (Amount)
AS
(
    SELECT (0)
      FROM DocumentOutputItems
  GROUP BY DocumentId, DocumentSessionId
)
SELECT COUNT(*) AS NumberOfDistinctRows
  FROM internalQuery

You can just use the Count Function Twice.

In this case, it would be:

SELECT COUNT (DISTINCT DocumentId), COUNT (DISTINCT DocumentSessionId) 
FROM DocumentOutputItems

If you're working with datatypes of fixed length, you can cast to binary to do this very easily and very quickly. Assuming DocumentId and DocumentSessionId are both ints, and are therefore 4 bytes long...

SELECT COUNT(DISTINCT CAST(DocumentId as binary(4)) + CAST(DocumentSessionId as binary(4)))
FROM DocumentOutputItems

My specific problem required me to divide a SUM by the COUNT of the distinct combination of various foreign keys and a date field, grouping by another foreign key and occasionally filtering by certain values or keys. The table is very large, and using a sub-query dramatically increased the query time. And due to the complexity, statistics simply wasn't a viable option. The CHECKSUM solution was also far too slow in its conversion, particularly as a result of the various data types, and I couldn't risk its unreliability.

However, using the above solution had virtually no increase on the query time (comparing with using simply the SUM), and should be completely reliable! It should be able to help others in a similar situation so I'm posting it here.


I found this when I Googled for my own issue, found that if you count DISTINCT objects, you get the correct number returned (I'm using MySQL)

SELECT COUNT(DISTINCT DocumentID) AS Count1, 
  COUNT(DISTINCT DocumentSessionId) AS Count2
  FROM DocumentOutputItems

What is it about your existing query that you don't like? If you are concerned that DISTINCT across two columns does not return just the unique permutations why not try it?

It certainly works as you might expect in Oracle.

SQL> select distinct deptno, job from emp
  2  order by deptno, job
  3  /

    DEPTNO JOB
---------- ---------
        10 CLERK
        10 MANAGER
        10 PRESIDENT
        20 ANALYST
        20 CLERK
        20 MANAGER
        30 CLERK
        30 MANAGER
        30 SALESMAN

9 rows selected.


SQL> select count(*) from (
  2  select distinct deptno, job from emp
  3  )
  4  /

  COUNT(*)
----------
         9

SQL>

edit

I went down a blind alley with analytics but the answer was depressingly obvious...

SQL> select count(distinct concat(deptno,job)) from emp
  2  /

COUNT(DISTINCTCONCAT(DEPTNO,JOB))
---------------------------------
                                9

SQL>

edit 2

Given the following data the concatenating solution provided above will miscount:

col1  col2
----  ----
A     AA
AA    A

So we to include a separator...

select col1 + '*' + col2 from t23
/

Obviously the chosen separator must be a character, or set of characters, which can never appear in either column.


if you had only one field to "DISTINCT", you could use:

SELECT COUNT(DISTINCT DocumentId) 
FROM DocumentOutputItems

and that does return the same query plan as the original, as tested with SET SHOWPLAN_ALL ON. However you are using two fields so you could try something crazy like:

    SELECT COUNT(DISTINCT convert(varchar(15),DocumentId)+'|~|'+convert(varchar(15), DocumentSessionId)) 
    FROM DocumentOutputItems

but you'll have issues if NULLs are involved. I'd just stick with the original query.


To run as a single query, concatenate the columns, then get the distinct count of instances of the concatenated string.

SELECT count(DISTINCT concat(DocumentId, DocumentSessionId)) FROM DocumentOutputItems;

In MySQL you can do the same thing without the concatenation step as follows:

SELECT count(DISTINCT DocumentId, DocumentSessionId) FROM DocumentOutputItems;

This feature is mentioned in the MySQL documentation:

http://dev.mysql.com/doc/refman/5.7/en/group-by-functions.html#function_count-distinct


Here's a shorter version without the subselect:

SELECT COUNT(DISTINCT DocumentId, DocumentSessionId) FROM DocumentOutputItems

It works fine in MySQL, and I think that the optimizer has an easier time understanding this one.

Edit: Apparently I misread MSSQL and MySQL - sorry about that, but maybe it helps anyway.


Examples related to sql

Passing multiple values for same variable in stored procedure SQL permissions for roles Generic XSLT Search and Replace template Access And/Or exclusions Pyspark: Filter dataframe based on multiple conditions Subtracting 1 day from a timestamp date PYODBC--Data source name not found and no default driver specified select rows in sql with latest date for each ID repeated multiple times ALTER TABLE DROP COLUMN failed because one or more objects access this column Create Local SQL Server database

Examples related to sql-server

Passing multiple values for same variable in stored procedure SQL permissions for roles Count the Number of Tables in a SQL Server Database Visual Studio 2017 does not have Business Intelligence Integration Services/Projects ALTER TABLE DROP COLUMN failed because one or more objects access this column Create Local SQL Server database How to create temp table using Create statement in SQL Server? SQL Query Where Date = Today Minus 7 Days How do I pass a list as a parameter in a stored procedure? SQL Server date format yyyymmdd

Examples related to performance

Why is 2 * (i * i) faster than 2 * i * i in Java? What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism? How to check if a key exists in Json Object and get its value Why does C++ code for testing the Collatz conjecture run faster than hand-written assembly? Most efficient way to map function over numpy array The most efficient way to remove first N elements in a list? Fastest way to get the first n elements of a List into an Array Why is "1000000000000000 in range(1000000000000001)" so fast in Python 3? pandas loc vs. iloc vs. at vs. iat? Android Recyclerview vs ListView with Viewholder

Examples related to tsql

Passing multiple values for same variable in stored procedure Count the Number of Tables in a SQL Server Database Change Date Format(DD/MM/YYYY) in SQL SELECT Statement Stored procedure with default parameters Format number as percent in MS SQL Server EXEC sp_executesql with multiple parameters SQL Server after update trigger How to compare datetime with only date in SQL Server Text was truncated or one or more characters had no match in the target code page including the primary key in an unpivot Printing integer variable and string on same line in SQL

Examples related to query-optimization

How to do the Recursive SELECT query in MySQL? What is the optimal way to compare dates in Microsoft SQL server? How to find out what is locking my tables? Finding blocking/locking queries in MS SQL (mssql) SQL: How to properly check if a record exists Counting DISTINCT over multiple columns JOIN queries vs multiple queries PostgreSQL - fetch the row which has the Max value for a column mysql select from n last rows