PostgreSQL - fetch the row which has the Max value for a column

Question

I m dealing with a Postgres table  called  lives   that contains records with columns for time stamp  usr id  transaction id  and lives remaining  I need a query that will give me the most recent lives remaining total for each usr id   There are multiple users  distinct usr id s  time stamp is not a unique identifier  sometimes user events  one by row in the table  will occur with the same time stamp  trans id is unique only for very small time ranges  over time it repeats remaining lives  for a given user  can both increase and decrease over time   example   time stamp lives remaining usr id trans id -----------------------------------------   07 00          1           1      1       09 00          4           2      2       10 00          2           3      3       10 00          1           2      4       11 00          4           1      5       11 00          3           1      6       13 00          3           3      1       As I will need to access other columns of the row with the latest data for each given usr id  I need a query that gives a result like this   time stamp lives remaining usr id trans id -----------------------------------------   11 00          3           1      6       10 00          1           2      4       13 00          3           3      1       As mentioned  each usr id can gain or lose lives  and sometimes these timestamped events occur so close together that they have the same timestamp  Therefore this query won t work   SELECT b time stamp b lives remaining b usr id b trans id FROM         SELECT usr id  max time stamp  AS max timestamp         FROM lives GROUP BY usr id ORDER BY usr id  a  JOIN lives b ON a max timestamp   b time stamp   Instead  I need to use both time stamp  first  and trans id  second  to identify the correct row  I also then need to pass that information from the subquery to the main query that will provide the data for the other columns of the appropriate rows  This is the hacked up query that I ve gotten to work   SELECT b time stamp b lives remaining b usr id b trans id FROM         SELECT usr id  max time stamp           trans id          AS max timestamp transid        FROM lives GROUP BY usr id ORDER BY usr id  a  JOIN lives b ON a max timestamp transid   b time stamp           b trans id  ORDER BY b usr id   Okay  so this works  but I don t like it  It requires a query within a query  a self join  and it seems to me that it could be much simpler by grabbing the row that MAX found to have the largest timestamp and trans id  The table  lives  has tens of millions of rows to parse  so I d like this query to be as fast and efficient as possible  I m new to RDBM and Postgres in particular  so I know that I need to make effective use of the proper indexes  I m a bit lost on how to optimize   I found a similar discussion here  Can I perform some type of Postgres equivalent to an Oracle analytic function   Any advice on accessing related column information used by an aggregate function  like MAX   creating indexes  and creating better queries would be much appreciated   P S  You can use the following to create my example case   create TABLE lives  time stamp timestamp  lives remaining integer                       usr id integer  trans id integer   insert into lives values   2000-01-01 07 00   1  1  1   insert into lives values   2000-01-01 09 00   4  2  2   insert into lives values   2000-01-01 10 00   2  3  3   insert into lives values   2000-01-01 10 00   1  2  4   insert into lives values   2000-01-01 11 00   4  1  5   insert into lives values   2000-01-01 11 00   3  1  6   insert into lives values   2000-01-01 13 00   3  3  1

User · Answer

I like the style of Mike Woodhouse s answer on the other page you mentioned   It s especially concise when the thing being maximised over is just a single column  in which case the subquery can just use MAX some col  and GROUP BY the other columns  but in your case you have a 2-part quantity to be maximised  you can still do so by using ORDER BY plus LIMIT 1 instead  as done by Quassnoi    SELECT    FROM lives outer WHERE  usr id  time stamp  trans id  IN       SELECT usr id  time stamp  trans id     FROM lives sq     WHERE sq usr id   outer usr id     ORDER BY trans id  time stamp     LIMIT 1     I find using the row-constructor syntax WHERE  a  b  c  IN  subquery  nice because it cuts down on the amount of verbiage needed

User · Answer

There is a new option in Postgressql 9 5 called DISTINCT ON  SELECT DISTINCT ON  location  location  time  report     FROM weather reports     ORDER BY location  time DESC    It eliminates duplicate rows an leaves only the first row as defined my the ORDER BY clause   see the official documentation

User · Answer

On a table with 158k pseudo-random rows  usr id uniformly distributed between 0 and 10k  trans id uniformly distributed between 0 and 30    By query cost  below  I am referring to Postgres  cost based optimizer s cost estimate  with Postgres  default xxx cost values   which is a weighed function estimate of required I O and CPU resources  you can obtain this by firing up PgAdminIII and running  Query Explain  F7   on the query with  Query Explain options  set to  Analyze    Quassnoy s query has a cost estimate of 745k      and completes in 1 3 seconds  given a compound index on  usr id  trans id  time stamp   Bill s query has a cost estimate of 93k  and completes in 2 9 seconds  given a compound index on  usr id  trans id   Query  1 below has a cost estimate of 16k  and completes in 800ms  given a compound index on  usr id  trans id  time stamp   Query  2 below has a cost estimate of 14k  and completes in 800ms  given a compound function index on  usr id  EXTRACT EPOCH FROM time stamp   trans id     this is Postgres-specific  Query  3 below  Postgres 8 4   has a cost estimate and completion time comparable to  or better than  query  2  given a compound index on  usr id  time stamp  trans id    it has the advantage of scanning the lives table only once and  should you temporarily increase  if needed  work mem to accommodate the sort in memory  it will be by far the fastest of all queries    All times above include retrieval of the full 10k rows result-set   Your goal is minimal cost estimate and minimal query execution time  with an emphasis on estimated cost   Query execution can dependent significantly on runtime conditions  e g  whether relevant rows are already fully cached in memory or not   whereas the cost estimate is not   On the other hand  keep in mind that cost estimate is exactly that  an estimate   The best query execution time is obtained when running on a dedicated database without load  e g  playing with pgAdminIII on a development PC    Query time will vary in production based on actual machine load data access spread   When one query appears slightly faster   lt 20   than the other but has a much higher cost  it will generally be wiser to choose the one with higher execution time but lower cost   When you expect that there will be no competition for memory on your production machine at the time the query is run  e g  the RDBMS cache and filesystem cache won t be thrashed by concurrent queries and or filesystem activity  then the query time you obtained in standalone  e g  pgAdminIII on a development PC  mode will be representative   If there is contention on the production system  query time will degrade proportionally to the estimated cost ratio  as the query with the lower cost does not rely as much on cache whereas the query with higher cost will revisit the same data over and over  triggering additional I O in the absence of a stable cache   e g                  cost   time  dedicated machine        time  under load    ------------------- -------------------------- -----------------------  some query A    5k    all data cached   900ms    less i o      1000ms   some query B   50k    all data cached   900ms    lots of i o  10000ms     Do not forget to run ANALYZE lives once after creating the necessary indices     Query  1  -- incrementally narrow down the result set via inner joins --  the CBO may elect to perform one full index scan combined --  with cascading index lookups  or as hash aggregates terminated --  by one nested index lookup into lives - on my machine --  the latter query plan was selected given my memory settings and --  histogram SELECT   l1    FROM   lives AS l1  INNER JOIN       SELECT       usr id        MAX time stamp  AS time stamp max      FROM       lives      GROUP BY       usr id     AS l2  ON   l1 usr id       l2 usr id AND   l1 time stamp   l2 time stamp max  INNER JOIN       SELECT       usr id        time stamp        MAX trans id  AS trans max      FROM       lives      GROUP BY       usr id  time stamp     AS l3  ON   l1 usr id       l3 usr id AND   l1 time stamp   l3 time stamp AND   l1 trans id     l3 trans max   Query  2  -- cheat to obtain a max of the  time stamp  trans id  tuple in one pass -- this results in a single table scan and one nested index lookup into lives  --  by far the least I O intensive operation even in case of great scarcity --  of memory  least reliant on cache for the best performance  SELECT   l1    FROM   lives AS l1  INNER JOIN      SELECT      usr id       MAX ARRAY EXTRACT EPOCH FROM time stamp  trans id          AS compound time stamp     FROM      lives     GROUP BY      usr id     AS l2 ON   l1 usr id   l2 usr id AND   EXTRACT EPOCH FROM l1 time stamp    l2 compound time stamp 1  AND   l1 trans id   l2 compound time stamp 2    2013 01 29 update  Finally  as of version 8 4  Postgres supports Window Function meaning you can write something as simple and efficient as   Query  3  -- use Window Functions -- performs a SINGLE scan of the table SELECT DISTINCT ON  usr id    last value time stamp  OVER wnd    last value lives remaining  OVER wnd    usr id    last value trans id  OVER wnd  FROM lives  WINDOW wnd AS      PARTITION BY usr id ORDER BY time stamp  trans id    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING

User · Answer

SELECT  l   FROM              SELECT DISTINCT usr id         FROM   lives           lo  lives l WHERE   l ctid             SELECT ctid         FROM   lives li         WHERE  li usr id   lo usr id         ORDER BY           time stamp DESC  trans id DESC         LIMIT 1             Creating an index on  usr id  time stamp  trans id  will greatly improve this query   You should always  always have some kind of PRIMARY KEY in your tables

User · Answer

Actaully there s a hacky solution for this problem  Let s say you want to select the biggest tree of each forest in a region   SELECT  array agg tree id ORDER BY tree size size    1  FROM tree JOIN forest ON  tree forest   forest id  GROUP BY forest id   When you group trees by forests there will be an unsorted list of trees and you need to find the biggest one  First thing you should do is to sort the rows by their sizes and select the first one of your list  It may seems inefficient but if you have millions of rows it will be quite faster than the solutions that includes JOIN s and WHERE conditions   BTW  note that ORDER BY for array agg is introduced in Postgresql 9 0

User · Answer

I think you ve got one major problem here  there s no monotonically increasing  counter  to guarantee that a given row has happened later in time than another  Take this example   timestamp   lives remaining   user id   trans id 10 00       4                 3         5 10 00       5                 3         6 10 00       3                 3         1 10 00       2                 3         2   You cannot determine from this data which is the most recent entry  Is it the second one or the last one  There is no sort or max   function you can apply to any of this data to give you the correct answer   Increasing the resolution of the timestamp would be a huge help  Since the database engine serializes requests  with sufficient resolution you can guarantee that no two timestamps will be the same   Alternatively  use a trans id that won t roll over for a very  very long time  Having a trans id that rolls over means you can t tell  for the same timestamp  whether trans id 6 is more recent than trans id 1 unless you do some complicated math

User · Answer

I would propose a clean version based on DISTINCT ON  see docs    SELECT DISTINCT ON  usr id      time stamp      lives remaining      usr id      trans id FROM lives ORDER BY usr id  time stamp DESC  trans id DESC

User · Answer

You can do it with window functions SELECT t   FROM      SELECT                    ROW NUMBER   OVER PARTITION BY usr id ORDER BY time stamp DESC  as r     FROM lives  as t WHERE t r   1

User · Answer

Here s another method  which happens to use no correlated subqueries or GROUP BY   I m not expert in PostgreSQL performance tuning  so I suggest you try both this and the solutions given by other folks to see which works better for you   SELECT l1   FROM lives l1 LEFT OUTER JOIN lives l2   ON  l1 usr id   l2 usr id AND  l1 time stamp  lt  l2 time stamp     OR  l1 time stamp   l2 time stamp AND l1 trans id  lt  l2 trans id    WHERE l2 usr id IS NULL ORDER BY l1 usr id    I am assuming that trans id is unique at least over any given value of time stamp

[sql] PostgreSQL - fetch the row which has the Max value for a column

The answer is

Examples related to sql

Examples related to postgresql

Examples related to query-optimization

Examples related to cbo

Examples related to cost-based-optimizer

Tags