Counting DISTINCT over multiple columns

Question

Is there a better way of doing a query like this   SELECT COUNT     FROM  SELECT DISTINCT DocumentId  DocumentSessionId       FROM DocumentOutputItems  AS internalQuery   I need to count the number of distinct items from this table but the distinct is over two columns   My query works fine but I was wondering if I can get the final result using just one query  without using a sub-query

User · Answer

What is it about your existing query that you don't like? If you are concerned that DISTINCT across two columns does not return just the unique permutations why not try it?

It certainly works as you might expect in Oracle.

SQL> select distinct deptno, job from emp
  2  order by deptno, job
  3  /

    DEPTNO JOB
---------- ---------
        10 CLERK
        10 MANAGER
        10 PRESIDENT
        20 ANALYST
        20 CLERK
        20 MANAGER
        30 CLERK
        30 MANAGER
        30 SALESMAN

9 rows selected.


SQL> select count(*) from (
  2  select distinct deptno, job from emp
  3  )
  4  /

  COUNT(*)
----------
         9

SQL>

edit

I went down a blind alley with analytics but the answer was depressingly obvious...

SQL> select count(distinct concat(deptno,job)) from emp
  2  /

COUNT(DISTINCTCONCAT(DEPTNO,JOB))
---------------------------------
                                9

SQL>

edit 2

Given the following data the concatenating solution provided above will miscount:

col1  col2
----  ----
A     AA
AA    A

So we to include a separator...

select col1 + '*' + col2 from t23
/

Obviously the chosen separator must be a character, or set of characters, which can never appear in either column.

User · Answer

I had a similar question but the query I had was a sub-query with the comparison data in the main query  something like   Select code  id  title  name   select count distinct col1  from mytable where code   a code and length title   gt 0  from mytable a group by code  id  title  name --needs distinct over col2 as well as col1   ignoring the complexities of this  I realized I couldn t get the value of a code into the subquery with the double sub query described in the original question   Select count 1  from  select distinct col1  col2 from mytable where code   a code     --this doesn t work because the sub-query doesn t know what  a  is   So eventually I figured out I could cheat  and combine the columns   Select count distinct col1    col2   from mytable where code   a code      This is what ended up working

User · Answer

How about something like    select count    from    select count    cnt    from DocumentOutputItems    group by DocumentId  DocumentSessionId  t1    Probably just does the same as you are already though but it avoids the DISTINCT

User · Answer

if you had only one field to  DISTINCT   you could use   SELECT COUNT DISTINCT DocumentId   FROM DocumentOutputItems   and that does return the same query plan as the original  as tested with SET SHOWPLAN ALL ON   However you are using two fields so you could try something crazy like       SELECT COUNT DISTINCT convert varchar 15  DocumentId        convert varchar 15   DocumentSessionId        FROM DocumentOutputItems   but you ll have issues if NULLs are involved   I d just stick with the original query

User · Answer

This code uses distinct on 2 parameters and provides count of number of rows specific to those distinct values row count  It worked for me in MySQL like a charm   select DISTINCT DocumentId as i   DocumentSessionId as s   count     from DocumentOutputItems    group by i  s

User · Answer

Here s a shorter version without the subselect   SELECT COUNT DISTINCT DocumentId  DocumentSessionId  FROM DocumentOutputItems   It works fine in MySQL  and I think that the optimizer has an easier time understanding this one   Edit  Apparently I misread MSSQL and MySQL - sorry about that  but maybe it helps anyway

User · Answer

If you re working with datatypes of fixed length  you can cast to binary to do this very easily and very quickly  Assuming DocumentId and DocumentSessionId are both ints  and are therefore 4 bytes long     SELECT COUNT DISTINCT CAST DocumentId as binary 4     CAST DocumentSessionId as binary 4    FROM DocumentOutputItems   My specific problem required me to divide a SUM by the COUNT of the distinct combination of various foreign keys and a date field  grouping by another foreign key and occasionally filtering by certain values or keys  The table is very large  and using a sub-query dramatically increased the query time  And due to the complexity  statistics simply wasn t a viable option  The CHECKSUM solution was also far too slow in its conversion  particularly as a result of the various data types  and I couldn t risk its unreliability   However  using the above solution had virtually no increase on the query time  comparing with using simply the SUM   and should be completely reliable  It should be able to help others in a similar situation so I m posting it here

User · Answer

Edit  Altered from the less-than-reliable checksum-only query I ve discovered a way to do this  in SQL Server 2005  that works pretty well for me and I can use as many columns as I need  by adding them to the CHECKSUM   function    The REVERSE   function turns the ints into varchars to make the distinct more reliable  SELECT COUNT DISTINCT  CHECKSUM DocumentId DocumentSessionId     CHECKSUM REVERSE DocumentId  REVERSE DocumentSessionId     FROM DocumentOutPutItems

User · Answer

I have used this approach and it has worked for me   SELECT COUNT DISTINCT DocumentID    DocumentSessionId   FROM  DocumentOutputItems   For my case  it provides correct result

User · Answer

Many  most   SQL databases can work with tuples like values so you can just do   SELECT COUNT DISTINCT  DocumentId  DocumentSessionId         FROM DocumentOutputItems   If your database doesn t support this  it can be simulated as per  oncel-umut-turer s suggestion of CHECKSUM or other scalar function providing good uniqueness e g   COUNT DISTINCT CONCAT DocumentId       DocumentSessionId     A related use of tuples is performing IN queries such as   SELECT   FROM DocumentOutputItems WHERE  DocumentId  DocumentSessionId  in    a    1      b    2

User · Answer

Hope this works i am writing on prima vista  SELECT COUNT     FROM DocumentOutputItems  GROUP BY DocumentId  DocumentSessionId

User · Answer

You can just use the Count Function Twice   In this case  it would be   SELECT COUNT  DISTINCT DocumentId   COUNT  DISTINCT DocumentSessionId   FROM DocumentOutputItems

User · Answer

I wish MS SQL could also do something like COUNT DISTINCT A  B   But it can t   At first JayTee s answer seemed like a solution to me bu after some tests CHECKSUM   failed to create unique values  A quick example is  both CHECKSUM 31 467 519  and CHECKSUM 69 1120 823  gives the same answer which is 55   Then I made some research and found that Microsoft does NOT recommend using CHECKSUM for change detection purposes  In some forums some suggested using   SELECT COUNT DISTINCT CHECKSUM value1  value2       valueN    CHECKSUM valueN  value N-1        value1     but this is also not conforting   You can use HASHBYTES   function as suggested in TSQL CHECKSUM conundrum  However this also has a small chance of not returning unique results   I would suggest using  SELECT COUNT DISTINCT CAST DocumentId AS VARCHAR   -  CAST DocumentSessionId AS VARCHAR   FROM DocumentOutputItems

User · Answer

How about this   Select DocumentId  DocumentSessionId  count    as c  from DocumentOutputItems  group by DocumentId  DocumentSessionId    This will get us the count of all possible combinations of DocumentId  and DocumentSessionId

User · Answer

I found this when I Googled for my own issue  found that if you count DISTINCT objects  you get the correct number returned  I m using MySQL   SELECT COUNT DISTINCT DocumentID  AS Count1     COUNT DISTINCT DocumentSessionId  AS Count2   FROM DocumentOutputItems

User · Answer

There s nothing wrong with your query  but you could also do it this way   WITH internalQuery  Amount  AS       SELECT  0        FROM DocumentOutputItems   GROUP BY DocumentId  DocumentSessionId   SELECT COUNT    AS NumberOfDistinctRows   FROM internalQuery

User · Answer

To run as a single query  concatenate the columns  then get the distinct count of instances of the concatenated string   SELECT count DISTINCT concat DocumentId  DocumentSessionId   FROM DocumentOutputItems    In MySQL you can do the same thing without the concatenation step as follows   SELECT count DISTINCT DocumentId  DocumentSessionId  FROM DocumentOutputItems    This feature is mentioned in the MySQL documentation   http   dev mysql com doc refman 5 7 en group-by-functions html function count-distinct

User · Answer

If you are trying to improve performance  you could try creating a persisted computed column on either a hash or concatenated value of the two columns   Once it is persisted   provided the column is deterministic and you are using  sane  database settings  it can be indexed and   or statistics can be created on it    I believe a distinct count of the computed column would be equivalent to your query

User · Answer

It works for me  In oracle   SELECT SUM DECODE COUNT    1 1 1   FROM DocumentOutputItems GROUP BY DocumentId  DocumentSessionId    In jpql   SELECT SUM CASE WHEN COUNT i  1 THEN 1 ELSE 1 END  FROM DocumentOutputItems i GROUP BY i DocumentId  i DocumentSessionId

[sql] Counting DISTINCT over multiple columns

Examples related to sql

Examples related to sql-server

Examples related to performance

Examples related to tsql

Examples related to query-optimization