Huge performance difference when using group by vs distinct

Question

I am performing some tests on a HSQLDB server with a table containing 500 000 entries  The table has no indices  There are 5000 distinct business keys  I need a list of them  Naturally I started with a DISTINCT query   SELECT DISTINCT business key FROM memory WHERE    concept  lt  gt   case  or     attrib  lt  gt   status  or     value  lt  gt   closed    It takes around 90 seconds     Then I tried using GROUP BY   SELECT business key FROM memory WHERE        concept  lt  gt   case  or         attrib  lt  gt   status  or         value  lt  gt   closed  GROUP BY business key   And it takes 1 second     Trying to figure out the difference I ran EXLAIN PLAN FOR but it seems to give the same information for both queries   EXLAIN PLAN FOR DISTINCT      isAggregated  false  columns     COLUMN  PUBLIC MEMORY BUSINESS KEY    range variable 1   join type INNER   table MEMORY   alias M   access FULL SCAN   condition        index SYS IDX SYS PK 10057 10058     other condition       OR arg left        OR arg left         NOT EQUAL arg left          COLUMN  PUBLIC MEMORY CONCEPT  arg right          VALUE   case  TYPE   CHARACTER   arg right         NOT EQUAL arg left          COLUMN  PUBLIC MEMORY ATTRIB  arg right          VALUE   status  TYPE   CHARACTER    arg right        NOT EQUAL arg left         COLUMN  PUBLIC MEMORY VALUE  arg right         VALUE   closed  TYPE   CHARACTER           PARAMETERS    SUBQUERIES   Object References PUBLIC MEMORY PUBLIC MEMORY CONCEPT PUBLIC MEMORY ATTRIB PUBLIC MEMORY VALUE PUBLIC MEMORY BUSINESS KEY Read Locks PUBLIC MEMORY WriteLocks   EXLAIN PLAN FOR SELECT     GROUP BY      isDistinctSelect  false  isGrouped  true  isAggregated  false  columns     COLUMN  PUBLIC MEMORY BUSINESS KEY    range variable 1   join type INNER   table MEMORY   alias M   access FULL SCAN   condition        index SYS IDX SYS PK 10057 10058     other condition       OR arg left        OR arg left         NOT EQUAL arg left          COLUMN  PUBLIC MEMORY CONCEPT  arg right          VALUE   case  TYPE   CHARACTER   arg right         NOT EQUAL arg left          COLUMN  PUBLIC MEMORY ATTRIB  arg right          VALUE   status  TYPE   CHARACTER    arg right        NOT EQUAL arg left         COLUMN  PUBLIC MEMORY VALUE  arg right         VALUE   closed  TYPE   CHARACTER           groupColumns   COLUMN  PUBLIC MEMORY BUSINESS KEY  PARAMETERS    SUBQUERIES   Object References PUBLIC MEMORY PUBLIC MEMORY CONCEPT PUBLIC MEMORY ATTRIB PUBLIC MEMORY VALUE PUBLIC MEMORY BUSINESS KEY Read Locks PUBLIC MEMORY WriteLocks   EDIT  I did additional tests  With 500 000 records in HSQLDB with all distinct business keys  the performance of DISTINCT is now better - 3 seconds  vs GROUP BY which took around 9 seconds   In MySQL both queries preform the same   MySQL  500 000 rows - 5 000 distinct business keys  Both queries  0 5 second MySQL  500 000 rows - all distinct business keys  SELECT DISTINCT     - 11 seconds SELECT     GROUP BY business key - 13 seconds  So the problem is only related to HSQLDB   I will be very grateful if someone can explain why there is such a drastic difference

User · Accepted Answer

The two queries express the same question. Apparently the query optimizer chooses two different execution plans. My guess would be that the distinct approach is executed like:

Copy all business_key values to a temporary table
Sort the temporary table
Scan the temporary table, returning each item that is different from the one before it

The group by could be executed like:

Scan the full table, storing each value of business key in a hashtable
Return the keys of the hashtable

The first method optimizes for memory usage: it would still perform reasonably well when part of the temporary table has to be swapped out. The second method optimizes for speed, but potentially requires a large amount of memory if there are a lot of different keys.

Since you either have enough memory or few different keys, the second method outperforms the first. It's not unusual to see performance differences of 10x or even 100x between two execution plans.

[sql] Huge performance difference when using group by vs distinct

Examples related to sql

Examples related to performance

Examples related to group-by

Examples related to distinct

Examples related to hsqldb