[sql] Select first row in each GROUP BY group?

As the title suggests, I'd like to select the first row of each set of rows grouped with a GROUP BY.

Specifically, if I've got a purchases table that looks like this:

SELECT * FROM purchases;

My Output:

id | customer | total
---+----------+------
 1 | Joe      | 5
 2 | Sally    | 3
 3 | Joe      | 2
 4 | Sally    | 1

I'd like to query for the id of the largest purchase (total) made by each customer. Something like this:

SELECT FIRST(id), customer, FIRST(total)
FROM  purchases
GROUP BY customer
ORDER BY total DESC;

Expected Output:

FIRST(id) | customer | FIRST(total)
----------+----------+-------------
        1 | Joe      | 5
        2 | Sally    | 3

This question is related to sql sqlite postgresql group-by greatest-n-per-group

The answer is


I use this way (postgresql only): https://wiki.postgresql.org/wiki/First/last_%28aggregate%29

-- Create a function that always returns the first non-NULL item
CREATE OR REPLACE FUNCTION public.first_agg ( anyelement, anyelement )
RETURNS anyelement LANGUAGE sql IMMUTABLE STRICT AS $$
        SELECT $1;
$$;

-- And then wrap an aggregate around it
CREATE AGGREGATE public.first (
        sfunc    = public.first_agg,
        basetype = anyelement,
        stype    = anyelement
);

-- Create a function that always returns the last non-NULL item
CREATE OR REPLACE FUNCTION public.last_agg ( anyelement, anyelement )
RETURNS anyelement LANGUAGE sql IMMUTABLE STRICT AS $$
        SELECT $2;
$$;

-- And then wrap an aggregate around it
CREATE AGGREGATE public.last (
        sfunc    = public.last_agg,
        basetype = anyelement,
        stype    = anyelement
);

Then your example should work almost as is:

SELECT FIRST(id), customer, FIRST(total)
FROM  purchases
GROUP BY customer
ORDER BY FIRST(total) DESC;

CAVEAT: It ignore's NULL rows


Edit 1 - Use the postgres extension instead

Now I use this way: http://pgxn.org/dist/first_last_agg/

To install on ubuntu 14.04:

apt-get install postgresql-server-dev-9.3 git build-essential -y
git clone git://github.com/wulczer/first_last_agg.git
cd first_last_app
make && sudo make install
psql -c 'create extension first_last_agg'

It's a postgres extension that gives you first and last functions; apparently faster than the above way.


Edit 2 - Ordering and filtering

If you use aggregate functions (like these), you can order the results, without the need to have the data already ordered:

http://www.postgresql.org/docs/current/static/sql-expressions.html#SYNTAX-AGGREGATES

So the equivalent example, with ordering would be something like:

SELECT first(id order by id), customer, first(total order by id)
  FROM purchases
 GROUP BY customer
 ORDER BY first(total);

Of course you can order and filter as you deem fit within the aggregate; it's very powerful syntax.


This way it work for me:

SELECT article, dealer, price
FROM   shop s1
WHERE  price=(SELECT MAX(s2.price)
              FROM shop s2
              WHERE s1.article = s2.article
              GROUP BY s2.article)
ORDER BY article;

Select highest price on each article


In SQL Server you can do this:

SELECT *
FROM (
SELECT ROW_NUMBER()
OVER(PARTITION BY customer
ORDER BY total DESC) AS StRank, *
FROM Purchases) n
WHERE StRank = 1

Explaination:Here Group by is done on the basis of customer and then order it by total then each such group is given serial number as StRank and we are taking out first 1 customer whose StRank is 1


Benchmark

Testing the most interesting candidates with Postgres 9.4 and 9.5 with a halfway realistic table of 200k rows in purchases and 10k distinct customer_id (avg. 20 rows per customer).

For Postgres 9.5 I ran a 2nd test with effectively 86446 distinct customers. See below (avg. 2.3 rows per customer).

Setup

Main table

CREATE TABLE purchases (
  id          serial
, customer_id int  -- REFERENCES customer
, total       int  -- could be amount of money in Cent
, some_column text -- to make the row bigger, more realistic
);

I use a serial (PK constraint added below) and an integer customer_id since that's a more typical setup. Also added some_column to make up for typically more columns.

Dummy data, PK, index - a typical table also has some dead tuples:

INSERT INTO purchases (customer_id, total, some_column)    -- insert 200k rows
SELECT (random() * 10000)::int             AS customer_id  -- 10k customers
     , (random() * random() * 100000)::int AS total     
     , 'note: ' || repeat('x', (random()^2 * random() * random() * 500)::int)
FROM   generate_series(1,200000) g;

ALTER TABLE purchases ADD CONSTRAINT purchases_id_pkey PRIMARY KEY (id);

DELETE FROM purchases WHERE random() > 0.9; -- some dead rows

INSERT INTO purchases (customer_id, total, some_column)
SELECT (random() * 10000)::int             AS customer_id  -- 10k customers
     , (random() * random() * 100000)::int AS total     
     , 'note: ' || repeat('x', (random()^2 * random() * random() * 500)::int)
FROM   generate_series(1,20000) g;  -- add 20k to make it ~ 200k

CREATE INDEX purchases_3c_idx ON purchases (customer_id, total DESC, id);

VACUUM ANALYZE purchases;

customer table - for superior query:

CREATE TABLE customer AS
SELECT customer_id, 'customer_' || customer_id AS customer
FROM   purchases
GROUP  BY 1
ORDER  BY 1;

ALTER TABLE customer ADD CONSTRAINT customer_customer_id_pkey PRIMARY KEY (customer_id);

VACUUM ANALYZE customer;

In my second test for 9.5 I used the same setup, but with random() * 100000 to generate customer_id to get only few rows per customer_id.

Object sizes for table purchases

Generated with a query taken from this related answer:

               what                | bytes/ct | bytes_pretty | bytes_per_row
-----------------------------------+----------+--------------+---------------
 core_relation_size                | 20496384 | 20 MB        |           102
 visibility_map                    |        0 | 0 bytes      |             0
 free_space_map                    |    24576 | 24 kB        |             0
 table_size_incl_toast             | 20529152 | 20 MB        |           102
 indexes_size                      | 10977280 | 10 MB        |            54
 total_size_incl_toast_and_indexes | 31506432 | 30 MB        |           157
 live_rows_in_text_representation  | 13729802 | 13 MB        |            68
 ------------------------------    |          |              |
 row_count                         |   200045 |              |
 live_tuples                       |   200045 |              |
 dead_tuples                       |    19955 |              |

Queries

1. row_number() in CTE, (see other answer)

WITH cte AS (
   SELECT id, customer_id, total
        , row_number() OVER(PARTITION BY customer_id ORDER BY total DESC) AS rn
   FROM   purchases
   )
SELECT id, customer_id, total
FROM   cte
WHERE  rn = 1;

  1. row_number() in subquery (my optimization)
SELECT id, customer_id, total
FROM   (
   SELECT id, customer_id, total
        , row_number() OVER(PARTITION BY customer_id ORDER BY total DESC) AS rn
   FROM   purchases
   ) sub
WHERE  rn = 1;

3. DISTINCT ON (see other answer)

SELECT DISTINCT ON (customer_id)
       id, customer_id, total
FROM   purchases
ORDER  BY customer_id, total DESC, id;

4. rCTE with LATERAL subquery (see here)

WITH RECURSIVE cte AS (
   (  -- parentheses required
   SELECT id, customer_id, total
   FROM   purchases
   ORDER  BY customer_id, total DESC
   LIMIT  1
   )
   UNION ALL
   SELECT u.*
   FROM   cte c
   ,      LATERAL (
      SELECT id, customer_id, total
      FROM   purchases
      WHERE  customer_id > c.customer_id  -- lateral reference
      ORDER  BY customer_id, total DESC
      LIMIT  1
      ) u
   )
SELECT id, customer_id, total
FROM   cte
ORDER  BY customer_id;

5. customer table with LATERAL (see here)

SELECT l.*
FROM   customer c
,      LATERAL (
   SELECT id, customer_id, total
   FROM   purchases
   WHERE  customer_id = c.customer_id  -- lateral reference
   ORDER  BY total DESC
   LIMIT  1
   ) l;

6. array_agg() with ORDER BY (see other answer)

SELECT (array_agg(id ORDER BY total DESC))[1] AS id
     , customer_id
     , max(total) AS total
FROM   purchases
GROUP  BY customer_id;

Results

Execution time for above queries with EXPLAIN ANALYZE (and all options off), best of 5 runs.

All queries used an Index Only Scan on purchases2_3c_idx (among other steps). Some of them just for the smaller size of the index, others more effectively.

A. Postgres 9.4 with 200k rows and ~ 20 per customer_id

1. 273.274 ms  
2. 194.572 ms  
3. 111.067 ms  
4.  92.922 ms  
5.  37.679 ms  -- winner
6. 189.495 ms

B. The same with Postgres 9.5

1. 288.006 ms
2. 223.032 ms  
3. 107.074 ms  
4.  78.032 ms  
5.  33.944 ms  -- winner
6. 211.540 ms  

C. Same as B., but with ~ 2.3 rows per customer_id

1. 381.573 ms
2. 311.976 ms
3. 124.074 ms  -- winner
4. 710.631 ms
5. 311.976 ms
6. 421.679 ms

Related benchmarks

Here is a new one by "ogr" testing with 10M rows and 60k unique "customers" on Postgres 11.5 (current as of Sep. 2019). Results are still in line with what we have seen so far:

Original (outdated) benchmark from 2011

I ran three tests with PostgreSQL 9.1 on a real life table of 65579 rows and single-column btree indexes on each of the three columns involved and took the best execution time of 5 runs.
Comparing @OMGPonies' first query (A) to the above DISTINCT ON solution (B):

  1. Select the whole table, results in 5958 rows in this case.
A: 567.218 ms
B: 386.673 ms
  1. Use condition WHERE customer BETWEEN x AND y resulting in 1000 rows.
A: 249.136 ms
B:  55.111 ms
  1. Select a single customer with WHERE customer = x.
A:   0.143 ms
B:   0.072 ms

Same test repeated with the index described in the other answer

CREATE INDEX purchases_3c_idx ON purchases (customer, total DESC, id);

1A: 277.953 ms  
1B: 193.547 ms

2A: 249.796 ms -- special index not used  
2B:  28.679 ms

3A:   0.120 ms  
3B:   0.048 ms

In Postgres you can use array_agg like this:

SELECT  customer,
        (array_agg(id ORDER BY total DESC))[1],
        max(total)
FROM purchases
GROUP BY customer

This will give you the id of each customer's largest purchase.

Some things to note:

  • array_agg is an aggregate function, so it works with GROUP BY.
  • array_agg lets you specify an ordering scoped to just itself, so it doesn't constrain the structure of the whole query. There is also syntax for how you sort NULLs, if you need to do something different from the default.
  • Once we build the array, we take the first element. (Postgres arrays are 1-indexed, not 0-indexed).
  • You could use array_agg in a similar way for your third output column, but max(total) is simpler.
  • Unlike DISTINCT ON, using array_agg lets you keep your GROUP BY, in case you want that for other reasons.

Very fast solution

SELECT a.* 
FROM
    purchases a 
    JOIN ( 
        SELECT customer, min( id ) as id 
        FROM purchases 
        GROUP BY customer 
    ) b USING ( id );

and really very fast if table is indexed by id:

create index purchases_id on purchases (id);

In PostgreSQL this is typically simpler and faster (more performance optimization below):

SELECT DISTINCT ON (customer)
       id, customer, total
FROM   purchases
ORDER  BY customer, total DESC, id;

Or shorter (if not as clear) with ordinal numbers of output columns:

SELECT DISTINCT ON (2)
       id, customer, total
FROM   purchases
ORDER  BY 2, 3 DESC, 1;

If total can be NULL (won't hurt either way, but you'll want to match existing indexes):

...
ORDER  BY customer, total DESC NULLS LAST, id;

Major points

DISTINCT ON is a PostgreSQL extension of the standard (where only DISTINCT on the whole SELECT list is defined).

List any number of expressions in the DISTINCT ON clause, the combined row value defines duplicates. The manual:

Obviously, two rows are considered distinct if they differ in at least one column value. Null values are considered equal in this comparison.

Bold emphasis mine.

DISTINCT ON can be combined with ORDER BY. Leading expressions in ORDER BY must be in the set of expressions in DISTINCT ON, but you can rearrange order among those freely. Example.
You can add additional expressions to ORDER BY to pick a particular row from each group of peers. Or, as the manual puts it:

The DISTINCT ON expression(s) must match the leftmost ORDER BY expression(s). The ORDER BY clause will normally contain additional expression(s) that determine the desired precedence of rows within each DISTINCT ON group.

I added id as last item to break ties:
"Pick the row with the smallest id from each group sharing the highest total."

To order results in a way that disagrees with the sort order determining the first per group, you can nest above query in an outer query with another ORDER BY. Example.

If total can be NULL, you most probably want the row with the greatest non-null value. Add NULLS LAST like demonstrated. See:

The SELECT list is not constrained by expressions in DISTINCT ON or ORDER BY in any way. (Not needed in the simple case above):

  • You don't have to include any of the expressions in DISTINCT ON or ORDER BY.

  • You can include any other expression in the SELECT list. This is instrumental for replacing much more complex queries with subqueries and aggregate / window functions.

I tested with Postgres versions 8.3 – 13. But the feature has been there at least since version 7.1, so basically always.

Index

The perfect index for the above query would be a multi-column index spanning all three columns in matching sequence and with matching sort order:

CREATE INDEX purchases_3c_idx ON purchases (customer, total DESC, id);

May be too specialized. But use it if read performance for the particular query is crucial. If you have DESC NULLS LAST in the query, use the same in the index so that sort order matches and the index is applicable.

Effectiveness / Performance optimization

Weigh cost and benefit before creating tailored indexes for each query. The potential of above index largely depends on data distribution.

The index is used because it delivers pre-sorted data. In Postgres 9.2 or later the query can also benefit from an index only scan if the index is smaller than the underlying table. The index has to be scanned in its entirety, though.

For few rows per customer (high cardinality in column customer), this is very efficient. Even more so if you need sorted output anyway. The benefit shrinks with a growing number of rows per customer.
Ideally, you have enough work_mem to process the involved sort step in RAM and not spill to disk. But generally setting work_mem too high can have adverse effects. Consider SET LOCAL for exceptionally big queries. Find how much you need with EXPLAIN ANALYZE. Mention of "Disk:" in the sort step indicates the need for more:

For many rows per customer (low cardinality in column customer), a loose index scan (a.k.a. "skip scan") would be (much) more efficient, but that's not implemented up to Postgres 13. (An implementation for index-only scans is in development for Postgres 14. See here and here.)
For now, there are faster query techniques to substitute for this. In particular if you have a separate table holding unique customers, which is the typical use case. But also if you don't:

Benchmark

I had a simple benchmark here which is outdated by now. I replaced it with a detailed benchmark in this separate answer.


Use ARRAY_AGG function for PostgreSQL, U-SQL, IBM DB2, and Google BigQuery SQL:

SELECT customer, (ARRAY_AGG(id ORDER BY total DESC))[1], MAX(total)
FROM purchases
GROUP BY customer

For SQl Server the most efficient way is:

with
ids as ( --condition for split table into groups
    select i from (values (9),(12),(17),(18),(19),(20),(22),(21),(23),(10)) as v(i) 
) 
,src as ( 
    select * from yourTable where  <condition> --use this as filter for other conditions
)
,joined as (
    select tops.* from ids 
    cross apply --it`s like for each rows
    (
        select top(1) * 
        from src
        where CommodityId = ids.i 
    ) as tops
)
select * from joined

and don't forget to create clustered index for used columns


  • If you want to select any (by your some specific condition) row from the set of aggregated rows.

  • If you want to use another (sum/avg) aggregation function in addition to max/min. Thus you can not use clue with DISTINCT ON

You can use next subquery:

SELECT  
    (  
       SELECT **id** FROM t2   
       WHERE id = ANY ( ARRAY_AGG( tf.id ) ) AND amount = MAX( tf.amount )   
    ) id,  
    name,   
    MAX(amount) ma,  
    SUM( ratio )  
FROM t2  tf  
GROUP BY name

You can replace amount = MAX( tf.amount ) with any condition you want with one restriction: This subquery must not return more than one row

But if you wanna to do such things you probably looking for window functions


The solution is not very efficient as pointed by Erwin, because of presence of SubQs

select * from purchases p1 where total in
(select max(total) from purchases where p1.customer=customer) order by total desc;

In PostgreSQL, another possibility is to use the first_value window function in combination with SELECT DISTINCT:

select distinct customer_id,
                first_value(row(id, total)) over(partition by customer_id order by total desc, id)
from            purchases;

I created a composite (id, total), so both values are returned by the same aggregate. You can of course always apply first_value() twice.


Snowflake/Teradata supports QUALIFY clause which works like HAVING for windowed functions:

SELECT id, customer, total
FROM PURCHASES
QUALIFY ROW_NUMBER() OVER(PARTITION BY p.customer ORDER BY p.total DESC) = 1

The Query:

SELECT purchases.*
FROM purchases
LEFT JOIN purchases as p 
ON 
  p.customer = purchases.customer 
  AND 
  purchases.total < p.total
WHERE p.total IS NULL

HOW DOES THAT WORK! (I've been there)

We want to make sure that we only have the highest total for each purchase.


Some Theoretical Stuff (skip this part if you only want to understand the query)

Let Total be a function T(customer,id) where it returns a value given the name and id To prove that the given total (T(customer,id)) is the highest we have to prove that We want to prove either

  • ?x T(customer,id) > T(customer,x) (this total is higher than all other total for that customer)

OR

  • ¬?x T(customer, id) < T(customer, x) (there exists no higher total for that customer)

The first approach will need us to get all the records for that name which I do not really like.

The second one will need a smart way to say there can be no record higher than this one.


Back to SQL

If we left joins the table on the name and total being less than the joined table:

LEFT JOIN purchases as p 
ON 
p.customer = purchases.customer 
AND 
purchases.total < p.total

we make sure that all records that have another record with the higher total for the same user to be joined:

+--------------+---------------------+-----------------+------+------------+---------+
| purchases.id |  purchases.customer | purchases.total | p.id | p.customer | p.total |
+--------------+---------------------+-----------------+------+------------+---------+
|            1 | Tom                 |             200 |    2 | Tom        |     300 |
|            2 | Tom                 |             300 |      |            |         |
|            3 | Bob                 |             400 |    4 | Bob        |     500 |
|            4 | Bob                 |             500 |      |            |         |
|            5 | Alice               |             600 |    6 | Alice      |     700 |
|            6 | Alice               |             700 |      |            |         |
+--------------+---------------------+-----------------+------+------------+---------+

That will help us filter for the highest total for each purchase with no grouping needed:

WHERE p.total IS NULL
    
+--------------+----------------+-----------------+------+--------+---------+
| purchases.id | purchases.name | purchases.total | p.id | p.name | p.total |
+--------------+----------------+-----------------+------+--------+---------+
|            2 | Tom            |             300 |      |        |         |
|            4 | Bob            |             500 |      |        |         |
|            6 | Alice          |             700 |      |        |         |
+--------------+----------------+-----------------+------+--------+---------+

And that's the answer we need.


This is common problem, which already has well tested and highly optimized solutions. Personally I prefer the left join solution by Bill Karwin (the original post with lots of other solutions).

Note that bunch of solutions to this common problem can surprisingly be found in the one of most official sources, MySQL manual! See Examples of Common Queries :: The Rows Holding the Group-wise Maximum of a Certain Column.


The accepted OMG Ponies' "Supported by any database" solution has good speed from my test.

Here I provide a same-approach, but more complete and clean any-database solution. Ties are considered (assume desire to get only one row for each customer, even multiple records for max total per customer), and other purchase fields (e.g. purchase_payment_id) will be selected for the real matching rows in the purchase table.

Supported by any database:

select * from purchase
join (
    select min(id) as id from purchase
    join (
        select customer, max(total) as total from purchase
        group by customer
    ) t1 using (customer, total)
    group by customer
) t2 using (id)
order by customer

This query is reasonably fast especially when there is a composite index like (customer, total) on the purchase table.

Remark:

  1. t1, t2 are subquery alias which could be removed depending on database.

  2. Caveat: the using (...) clause is currently not supported in MS-SQL and Oracle db as of this edit on Jan 2017. You have to expand it yourself to e.g. on t2.id = purchase.id etc. The USING syntax works in SQLite, MySQL and PostgreSQL.


Examples related to sql

Passing multiple values for same variable in stored procedure SQL permissions for roles Generic XSLT Search and Replace template Access And/Or exclusions Pyspark: Filter dataframe based on multiple conditions Subtracting 1 day from a timestamp date PYODBC--Data source name not found and no default driver specified select rows in sql with latest date for each ID repeated multiple times ALTER TABLE DROP COLUMN failed because one or more objects access this column Create Local SQL Server database

Examples related to sqlite

getting " (1) no such column: _id10 " error Laravel: PDOException: could not find driver auto create database in Entity Framework Core How to open .SQLite files Accessing an SQLite Database in Swift When does SQLiteOpenHelper onCreate() / onUpgrade() run? Attempt to write a readonly database - Django w/ SELinux error Android sqlite how to check if a record exists How can I add the sqlite3 module to Python? "Insert if not exists" statement in SQLite

Examples related to postgresql

Subtracting 1 day from a timestamp date pgadmin4 : postgresql application server could not be contacted. Psql could not connect to server: No such file or directory, 5432 error? How to persist data in a dockerized postgres database using volumes input file appears to be a text format dump. Please use psql Postgres: check if array field contains value? Add timestamp column with default NOW() for new rows only Can't connect to Postgresql on port 5432 How to insert current datetime in postgresql insert query Connecting to Postgresql in a docker container from outside

Examples related to group-by

SELECT list is not in GROUP BY clause and contains nonaggregated column .... incompatible with sql_mode=only_full_group_by Count unique values using pandas groupby Pandas group-by and sum Count unique values with pandas per groups Group dataframe and get sum AND count? Error related to only_full_group_by when executing a query in MySql Pandas sum by groupby, but exclude certain columns Using DISTINCT along with GROUP BY in SQL Server Python Pandas : group by in group by and average? How do I create a new column from the output of pandas groupby().sum()?

Examples related to greatest-n-per-group

How to select the rows with maximum values in each group with dplyr? MAX function in where clause mysql Pandas get topmost n records within each group SQL Left Join first match only Select info from table where row has max date FORCE INDEX in MySQL - where do I put it? GROUP BY having MAX date How can I select rows with most recent timestamp for each key value? Select row with most recent date per user How to select id with max date group by category in PostgreSQL?