I'm dealing with a Postgres table (called "lives") that contains records with columns for time_stamp, usr_id, transaction_id, and lives_remaining. I need a query that will give me the most recent lives_remaining total for each usr_id
example:
time_stamp|lives_remaining|usr_id|trans_id ----------------------------------------- 07:00 | 1 | 1 | 1 09:00 | 4 | 2 | 2 10:00 | 2 | 3 | 3 10:00 | 1 | 2 | 4 11:00 | 4 | 1 | 5 11:00 | 3 | 1 | 6 13:00 | 3 | 3 | 1
As I will need to access other columns of the row with the latest data for each given usr_id, I need a query that gives a result like this:
time_stamp|lives_remaining|usr_id|trans_id ----------------------------------------- 11:00 | 3 | 1 | 6 10:00 | 1 | 2 | 4 13:00 | 3 | 3 | 1
As mentioned, each usr_id can gain or lose lives, and sometimes these timestamped events occur so close together that they have the same timestamp! Therefore this query won't work:
SELECT b.time_stamp,b.lives_remaining,b.usr_id,b.trans_id FROM
(SELECT usr_id, max(time_stamp) AS max_timestamp
FROM lives GROUP BY usr_id ORDER BY usr_id) a
JOIN lives b ON a.max_timestamp = b.time_stamp
Instead, I need to use both time_stamp (first) and trans_id (second) to identify the correct row. I also then need to pass that information from the subquery to the main query that will provide the data for the other columns of the appropriate rows. This is the hacked up query that I've gotten to work:
SELECT b.time_stamp,b.lives_remaining,b.usr_id,b.trans_id FROM
(SELECT usr_id, max(time_stamp || '*' || trans_id)
AS max_timestamp_transid
FROM lives GROUP BY usr_id ORDER BY usr_id) a
JOIN lives b ON a.max_timestamp_transid = b.time_stamp || '*' || b.trans_id
ORDER BY b.usr_id
Okay, so this works, but I don't like it. It requires a query within a query, a self join, and it seems to me that it could be much simpler by grabbing the row that MAX found to have the largest timestamp and trans_id. The table "lives" has tens of millions of rows to parse, so I'd like this query to be as fast and efficient as possible. I'm new to RDBM and Postgres in particular, so I know that I need to make effective use of the proper indexes. I'm a bit lost on how to optimize.
I found a similar discussion here. Can I perform some type of Postgres equivalent to an Oracle analytic function?
Any advice on accessing related column information used by an aggregate function (like MAX), creating indexes, and creating better queries would be much appreciated!
P.S. You can use the following to create my example case:
create TABLE lives (time_stamp timestamp, lives_remaining integer,
usr_id integer, trans_id integer);
insert into lives values ('2000-01-01 07:00', 1, 1, 1);
insert into lives values ('2000-01-01 09:00', 4, 2, 2);
insert into lives values ('2000-01-01 10:00', 2, 3, 3);
insert into lives values ('2000-01-01 10:00', 1, 2, 4);
insert into lives values ('2000-01-01 11:00', 4, 1, 5);
insert into lives values ('2000-01-01 11:00', 3, 1, 6);
insert into lives values ('2000-01-01 13:00', 3, 3, 1);
This question is related to
sql
postgresql
query-optimization
cbo
cost-based-optimizer
I like the style of Mike Woodhouse's answer on the other page you mentioned. It's especially concise when the thing being maximised over is just a single column, in which case the subquery can just use MAX(some_col)
and GROUP BY
the other columns, but in your case you have a 2-part quantity to be maximised, you can still do so by using ORDER BY
plus LIMIT 1
instead (as done by Quassnoi):
SELECT *
FROM lives outer
WHERE (usr_id, time_stamp, trans_id) IN (
SELECT usr_id, time_stamp, trans_id
FROM lives sq
WHERE sq.usr_id = outer.usr_id
ORDER BY trans_id, time_stamp
LIMIT 1
)
I find using the row-constructor syntax WHERE (a, b, c) IN (subquery)
nice because it cuts down on the amount of verbiage needed.
There is a new option in Postgressql 9.5 called DISTINCT ON
SELECT DISTINCT ON (location) location, time, report
FROM weather_reports
ORDER BY location, time DESC;
It eliminates duplicate rows an leaves only the first row as defined my the ORDER BY clause.
see the official documentation
SELECT l.*
FROM (
SELECT DISTINCT usr_id
FROM lives
) lo, lives l
WHERE l.ctid = (
SELECT ctid
FROM lives li
WHERE li.usr_id = lo.usr_id
ORDER BY
time_stamp DESC, trans_id DESC
LIMIT 1
)
Creating an index on (usr_id, time_stamp, trans_id)
will greatly improve this query.
You should always, always have some kind of PRIMARY KEY
in your tables.
Actaully there's a hacky solution for this problem. Let's say you want to select the biggest tree of each forest in a region.
SELECT (array_agg(tree.id ORDER BY tree_size.size)))[1]
FROM tree JOIN forest ON (tree.forest = forest.id)
GROUP BY forest.id
When you group trees by forests there will be an unsorted list of trees and you need to find the biggest one. First thing you should do is to sort the rows by their sizes and select the first one of your list. It may seems inefficient but if you have millions of rows it will be quite faster than the solutions that includes JOIN
's and WHERE
conditions.
BTW, note that ORDER_BY
for array_agg
is introduced in Postgresql 9.0
I think you've got one major problem here: there's no monotonically increasing "counter" to guarantee that a given row has happened later in time than another. Take this example:
timestamp lives_remaining user_id trans_id
10:00 4 3 5
10:00 5 3 6
10:00 3 3 1
10:00 2 3 2
You cannot determine from this data which is the most recent entry. Is it the second one or the last one? There is no sort or max() function you can apply to any of this data to give you the correct answer.
Increasing the resolution of the timestamp would be a huge help. Since the database engine serializes requests, with sufficient resolution you can guarantee that no two timestamps will be the same.
Alternatively, use a trans_id that won't roll over for a very, very long time. Having a trans_id that rolls over means you can't tell (for the same timestamp) whether trans_id 6 is more recent than trans_id 1 unless you do some complicated math.
I would propose a clean version based on DISTINCT ON
(see docs):
SELECT DISTINCT ON (usr_id)
time_stamp,
lives_remaining,
usr_id,
trans_id
FROM lives
ORDER BY usr_id, time_stamp DESC, trans_id DESC;
You can do it with window functions
SELECT t.*
FROM
(SELECT
*,
ROW_NUMBER() OVER(PARTITION BY usr_id ORDER BY time_stamp DESC) as r
FROM lives) as t
WHERE t.r = 1
Here's another method, which happens to use no correlated subqueries or GROUP BY. I'm not expert in PostgreSQL performance tuning, so I suggest you try both this and the solutions given by other folks to see which works better for you.
SELECT l1.*
FROM lives l1 LEFT OUTER JOIN lives l2
ON (l1.usr_id = l2.usr_id AND (l1.time_stamp < l2.time_stamp
OR (l1.time_stamp = l2.time_stamp AND l1.trans_id < l2.trans_id)))
WHERE l2.usr_id IS NULL
ORDER BY l1.usr_id;
I am assuming that trans_id
is unique at least over any given value of time_stamp
.
Source: Stackoverflow.com