Get top 1 row of each group

Question

I have a table which I want to get the latest entry for each group  Here s the table   DocumentStatusLogs Table   ID  DocumentID   Status   DateCreated     2  1            S1       7 29 2011       3  1            S2       7 30 2011       6  1            S1       8 02 2011       1  2            S1       7 28 2011       4  2            S2       7 30 2011       5  2            S3       8 01 2011       6  3            S1       8 02 2011       The table will be grouped by DocumentID and sorted by DateCreated in descending order  For each DocumentID  I want to get the latest status    My preferred output     DocumentID   Status   DateCreated     1            S1       8 02 2011       2            S3       8 01 2011       3            S1       8 02 2011        Is there any aggregate function to get only the top from each group  See pseudo-code GetOnlyTheTop below   SELECT   DocumentID    GetOnlyTheTop Status     GetOnlyTheTop DateCreated  FROM DocumentStatusLogs GROUP BY DocumentID ORDER BY DateCreated DESC  If such function doesn t exist  is there any way I can achieve the output I want  Or at the first place  could this be caused by unnormalized database  I m thinking  since what I m looking for is just one row  should that status also be located in the parent table    Please see the parent table for more information   Current Documents Table    DocumentID   Title    Content    DateCreated     1            TitleA                              2            TitleB                              3            TitleC                              Should the parent table be like this so that I can easily access its status     DocumentID   Title    Content    DateCreated   CurrentStatus     1            TitleA                            s1                2            TitleB                            s3                3            TitleC                            s1                UPDATE I just learned how to use  apply  which makes it easier to address such problems

User · Answer

I know this is an old thread but the TOP 1 WITH TIES solutions is quite nice and might be helpful to some reading through the solutions   select top 1 with ties    DocumentID    Status    DateCreated from DocumentStatusLogs order by row number   over  partition by DocumentID order by DateCreated desc    More about the TOP clause can be found here

User · Answer

It is checked in SQLite that you can use the following simple query with GROUP BY  SELECT MAX DateCreated     FROM DocumentStatusLogs GROUP BY DocumentID   Here MAX help to get the maximum DateCreated FROM each group   But it seems that MYSQL doesn t associate  -columns with the value of max DateCreated

User · Answer

Verifying Clint s awesome and correct answer from above  The performance between the two queries below is interesting  52  being the top one  And 48  being the second one  A 4  improvement in performance using DISTINCT instead of ORDER BY  But ORDER BY has the advantage to sort by multiple columns  IF  OBJECT ID  tempdb   DocumentStatusLogs   IS NOT NULL  BEGIN DROP TABLE  DocumentStatusLogs END  CREATE TABLE  DocumentStatusLogs        ID  int NOT NULL       DocumentID  int NOT NULL       Status  varchar 20        DateCreated  datetime    INSERT INTO  DocumentStatusLogs  ID    DocumentID    Status    DateCreated   VALUES  2  1   S1    7 29 2011 1 00 00   INSERT INTO  DocumentStatusLogs  ID    DocumentID    Status    DateCreated   VALUES  3  1   S2    7 30 2011 2 00 00   INSERT INTO  DocumentStatusLogs  ID    DocumentID    Status    DateCreated   VALUES  6  1   S1    8 02 2011 3 00 00   INSERT INTO  DocumentStatusLogs  ID    DocumentID    Status    DateCreated   VALUES  1  2   S1    7 28 2011 4 00 00   INSERT INTO  DocumentStatusLogs  ID    DocumentID    Status    DateCreated   VALUES  4  2   S2    7 30 2011 5 00 00   INSERT INTO  DocumentStatusLogs  ID    DocumentID    Status    DateCreated   VALUES  5  2   S3    8 01 2011 6 00 00   INSERT INTO  DocumentStatusLogs  ID    DocumentID    Status    DateCreated   VALUES  6  3   S1    8 02 2011 7 00 00    Option 1      SELECT      Extent1   ID         Extent1   DocumentID        Extent1   Status         Extent1   DateCreated  FROM  DocumentStatusLogs AS  Extent1      OUTER APPLY           SELECT TOP 1              Extent2   ID                 Extent2   DocumentID                Extent2   Status                 Extent2   DateCreated          FROM  DocumentStatusLogs AS  Extent2          WHERE  Extent1   DocumentID     Extent2   DocumentID          ORDER BY  Extent2   DateCreated  DESC   Extent2   ID  DESC       AS  Project2  WHERE   Project2   ID  IS NULL OR  Project2   ID     Extent1   ID    Option 2  SELECT       Limit1   DocumentID  AS  ID         Limit1   DocumentID  AS  DocumentID         Limit1   Status  AS  Status         Limit1   DateCreated  AS  DateCreated  FROM       SELECT DISTINCT  Extent1   DocumentID  AS  DocumentID  FROM  DocumentStatusLogs AS  Extent1    AS  Distinct1      OUTER APPLY            SELECT TOP  1   Project2   ID  AS  ID    Project2   DocumentID  AS  DocumentID    Project2   Status  AS  Status    Project2   DateCreated  AS  DateCreated          FROM               SELECT                   Extent2   ID  AS  ID                     Extent2   DocumentID  AS  DocumentID                     Extent2   Status  AS  Status                     Extent2   DateCreated  AS  DateCreated              FROM  DocumentStatusLogs AS  Extent2              WHERE  Distinct1   DocumentID     Extent2   DocumentID             AS  Project2          ORDER BY  Project2   ID  DESC       AS  Limit1   M  s Management Studio  After highlighting and running the first block  highlight both Option 1 and Option 2  Right click - gt   Display Estimated Execution Plan   Then run the entire thing to see the results  Option 1 Results  ID  DocumentID  Status  DateCreated 6   1   S1  8 2 11 3 00 5   2   S3  8 1 11 6 00 6   3   S1  8 2 11 7 00  Option 2 Results  ID  DocumentID  Status  DateCreated 6   1   S1  8 2 11 3 00 5   2   S3  8 1 11 6 00 6   3   S1  8 2 11 7 00  Note   I tend to use APPLY when I want a join to be 1-to- 1 of many   I use a JOIN if I want the join to be 1-to-many  or many-to-many  I avoid CTE with ROW NUMBER   unless I need to do something advanced and am ok with the windowing performance penalty   I also avoid EXISTS   IN subqueries in the WHERE or ON clause  as I have experienced this causing some terrible execution plans  But mileage varies  Review the execution plan and profile performance where and when needed

User · Answer

This is the most vanilla TSQL I can come up with      SELECT   FROM DocumentStatusLogs D1 JOIN             SELECT         DocumentID MAX DateCreated  AS MaxDate       FROM         DocumentStatusLogs       GROUP BY         DocumentID       D2     ON       D2 DocumentID D1 DocumentID     AND       D2 MaxDate D1 DateCreated

User · Answer

I ve done some timings over the various recommendations here  and the results really depend on the size of the table involved  but the most consistent solution is using the CROSS APPLY  These tests were run against SQL Server 2008-R2  using a table with 6 500 records  and another  identical schema  with 137 million records   The columns being queried are part of the primary key on the table  and the table width is very small  about 30 bytes    The times are reported by SQL Server from the actual execution plan   Query                                  Time for 6500  ms     Time for 137M ms   CROSS APPLY                                    17 9                17 9 SELECT WHERE col    SELECT MAX COL                6 6               854 4 DENSE RANK   OVER PARTITION                     6 6               907 1   I think the really amazing thing was how consistent the time was for the CROSS APPLY regardless of the number of rows involved

User · Answer

I just learned how to use cross apply  Here s how to use it in this scenario    select d DocumentID  ds Status  ds DateCreated   from Documents as d   cross apply        select top 1 Status  DateCreated       from DocumentStatusLogs        where DocumentID   d DocumentId       order by DateCreated desc  as ds

User · Answer

SELECT o   FROM  DocumentStatusLogs  o                      LEFT JOIN  DocumentStatusLogs  b                      ON o DocumentID   b DocumentID AND o DateCreated  lt  b DateCreated  WHERE b DocumentID is NULL     If you want to return only recent document order by DateCreated  it will return only top 1 document by DocumentID

User · Answer

Try this   SELECT  DocumentID        tmpRez  value   x 2     varchar 20    AS  Status        tmpRez  value   x 3     datetime   AS  DateCreated  FROM       SELECT  DocumentID           cast   lt x gt     max cast  ID  AS VARCHAR 10       lt  x gt  lt x gt      Status      lt  x gt  lt x gt     cast  DateCreated  AS VARCHAR 20        lt  x gt   AS XML  AS  tmpRez      FROM DocumentStatusLogs     GROUP BY DocumentID       AS  tmpQry

User · Answer

This is one of the most easily found question on the topic  so I wanted to give a modern answer to the it  both for my reference and to help others out   By using first value and over you can make short work of the above query   Select distinct DocumentID     first value status  over  partition by DocumentID order by DateCreated Desc  as Status     first value DateCreated  over  partition by DocumentID order by DateCreated Desc  as DateCreated From DocumentStatusLogs   This should work in Sql Server 2008 and up  First value can be thought of as a way to accomplish Select Top 1 when using an over clause  Over allows grouping in the select list so instead of writing nested subqueries  like many of the existing answers do   this does it in a more readable fashion  Hope this helps

User · Answer

If you re worried about performance  you can also do this with MAX     SELECT   FROM DocumentStatusLogs D WHERE DateCreated    SELECT MAX DateCreated  FROM DocumentStatusLogs WHERE ID   D ID    ROW NUMBER   requires a sort of all the rows in your SELECT statement  whereas MAX does not  Should drastically speed up your query

User · Answer

In scenarios where you want to avoid using row count    you can also use a left join   select ds DocumentID  ds Status  ds DateCreated  from DocumentStatusLogs ds left join DocumentStatusLogs filter      ON ds DocumentID   filter DocumentID     -- Match any row that has another row that was created after it      AND ds DateCreated  lt  filter DateCreated -- then filter out any rows that matched  where filter DocumentID is null    For the example schema  you could also use a  not in subquery   which generally compiles to the same output as the left join    select ds DocumentID  ds Status  ds DateCreated  from DocumentStatusLogs ds WHERE ds ID NOT IN       SELECT filter ID      FROM DocumentStatusLogs filter     WHERE ds DocumentID   filter DocumentID         AND ds DateCreated  lt  filter DateCreated    Note  the subquery pattern wouldn t work if the table didn t have at least one single-column unique key constraint index  in this case the primary key  Id    Both of these queries tend to be more  expensive  than the row count   query  as measured by Query Analyzer    However  you might encounter scenarios where they return results faster or enable other optimizations

User · Answer

CROSS APPLY was the method I used for my solution  as it worked for me  and for my clients needs  And from what I ve read  should provide the best overall performance should their database grow substantially

User · Answer

This solution can be used to get the TOP N most recent rows for each partition  in the example  N is 1 in the WHERE statement and partition is doc id   SELECT T doc id  T status  T date created FROM        SELECT a    ROW NUMBER   OVER  PARTITION BY doc id ORDER BY date created DESC  AS rnk FROM doc a   T WHERE T rnk   1

User · Answer

I believe this can be done just like this  This might need some tweaking but you can just select the max from the group    These answers are overkill    SELECT   d DocumentID    MAX d Status     MAX d1 DateCreated  FROM DocumentStatusLogs d  DocumentStatusLogs d1 USING DocumentID  GROUP BY d DocumentID ORDER BY DateCreated DESC

User · Answer

WITH cte AS      SELECT             ROW NUMBER   OVER  PARTITION BY DocumentID ORDER BY DateCreated DESC  AS rn    FROM DocumentStatusLogs   SELECT   FROM cte WHERE rn   1   If you expect 2 entries per day  then this will arbitrarily pick one  To get both entries for a day  use DENSE RANK instead  As for normalised or not  it depends if you want to    maintain status in 2 places preserve status history       As it stands  you preserve status history  If you want latest status in the parent table too  which is denormalisation  you d need a trigger to maintain  status  in the parent  or drop this status history table

User · Answer

My code to select top 1 from each group  select a   from  DocumentStatusLogs a where   datecreated in  select top 1 datecreated from  DocumentStatusLogs b where  a documentid   b documentid order by datecreated desc

User · Answer

Here are 3 separate approaches to the problem in hand along with the best choices of indexing for each of those queries  please try out the indexes yourselves and see the logical read  elapsed time  execution plan  I have provided the suggestions from my experience on such queries without executing for this specific problem    Approach 1  Using ROW NUMBER    If rowstore index is not being able to enhance the performance  you can try out nonclustered clustered columnstore index as for queries with aggregation and grouping and for tables which are ordered by in different columns all the times  columnstore index usually is the best choice    WITH CTE AS              SELECT                      RN   ROW NUMBER   OVER  PARTITION BY DocumentID ORDER BY DateCreated DESC         FROM     DocumentStatusLogs           SELECT  ID                DocumentID           Status               DateCreated     FROM    CTE     WHERE   RN   1    Approach 2  Using FIRST VALUE  If rowstore index is not being able to enhance the performance  you can try out nonclustered clustered columnstore index as for queries with aggregation and grouping and for tables which are ordered by in different columns all the times  columnstore index usually is the best choice   SELECT  DISTINCT     ID        FIRST VALUE ID  OVER  PARTITION BY DocumentID ORDER BY DateCreated DESC       DocumentID      Status       FIRST VALUE Status  OVER  PARTITION BY DocumentID ORDER BY DateCreated DESC       DateCreated      FIRST VALUE DateCreated  OVER  PARTITION BY DocumentID ORDER BY DateCreated DESC  FROM    DocumentStatusLogs    Approach 3  Using CROSS APPLY  Creating rowstore index on DocumentStatusLogs table covering the columns used in the query should be enough to cover the query without need of a columnstore index   SELECT  DISTINCT     ID        CA ID      DocumentID   D DocumentID      Status       CA Status       DateCreated      CA DateCreated FROM    DocumentStatusLogs D     CROSS APPLY               SELECT  TOP 1 I               FROM    DocumentStatusLogs I             WHERE   I DocumentID   D DocumentID             ORDER   BY I DateCreated DESC               CA

User · Answer

This is quite an old thread  but I thought I d throw my two cents in just the same as the accepted answer didn t work particularly well for me   I tried gbn s solution on a large dataset and found it to be terribly slow   45 seconds on 5 million plus records in SQL Server 2012    Looking at the execution plan it s obvious that the issue is that it requires a SORT operation which slows things down significantly   Here s an alternative that I lifted from the entity framework that needs no SORT operation and does a NON-Clustered Index search   This reduces the execution time down to  lt  2 seconds on the aforementioned record set   SELECT   Limit1   DocumentID  AS  DocumentID     Limit1   Status  AS  Status     Limit1   DateCreated  AS  DateCreated  FROM    SELECT DISTINCT  Extent1   DocumentID  AS  DocumentID  FROM  dbo   DocumentStatusLogs  AS  Extent1   AS  Distinct1  OUTER APPLY   SELECT TOP  1   Project2   ID  AS  ID    Project2   DocumentID  AS  DocumentID    Project2   Status  AS  Status    Project2   DateCreated  AS  DateCreated      FROM  SELECT           Extent2   ID  AS  ID             Extent2   DocumentID  AS  DocumentID             Extent2   Status  AS  Status             Extent2   DateCreated  AS  DateCreated          FROM  dbo   DocumentStatusLogs  AS  Extent2          WHERE   Distinct1   DocumentID     Extent2   DocumentID          AS  Project2      ORDER BY  Project2   ID  DESC  AS  Limit1    Now I m assuming something that isn t entirely specified in the original question  but if your table design is such that your ID column is an auto-increment ID  and the DateCreated is set to the current date with each insert  then even without running with my query above you could actually get a sizable performance boost to gbn s solution  about half the execution time  just from ordering on ID instead of ordering on DateCreated as this will provide an identical sort order and it s a faster sort

User · Answer

SELECT documentid          status          datecreated  FROM   documentstatuslogs dlogs  WHERE  status    SELECT status                   FROM   documentstatuslogs                   WHERE  documentid   dlogs documentid                   ORDER  BY datecreated DESC                   LIMIT  1

User · Answer

SELECT   FROM DocumentStatusLogs JOIN     SELECT DocumentID  MAX DateCreated  DateCreated   FROM DocumentStatusLogs   GROUP BY DocumentID     max date USING  DocumentID  DateCreated    What database server  This code doesn t work on all of them   Regarding the second half of your question  it seems reasonable to me to include the status as a column  You can leave DocumentStatusLogs as a log  but still store the latest info in the main table   BTW  if you already have the DateCreated column in the Documents table you can just join DocumentStatusLogs using that  as long as DateCreated is unique in DocumentStatusLogs    Edit  MsSQL does not support USING  so change it to   ON DocumentStatusLogs DocumentID   max date DocumentID AND DocumentStatusLogs DateCreated   max date DateCreated

[sql] Get top 1 row of each group

Examples related to sql

Examples related to tsql

Examples related to sql-server-2005

Examples related to group-by

Examples related to greatest-n-per-group