Select the top N values by group

59

This is in response to a question asked on the r-help mailing list.

Here are lots of examples of how to find top values by group using sql, so I imagine it's easy to convert that knowledge over using the R sqldf package.

An example: when mtcars is grouped by cyl, here are the top three records for each distinct value of cyl. Note that ties are excluded in this case, but it'd be nice to show some different ways to treat ties.

                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb ranks
Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1   2.0
Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2   1.0
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1   2.0
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4   3.0
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4   1.0
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4   1.5
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4   1.5
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4   3.0

How to find the top or bottom (maximum or minimum) N records per group?

This question is tagged with r aggregate

~ Asked on 2013-02-10 16:59:27

The Best Answer is


51

This seems more straightforward using data.table as it performs the sort while setting the key.

So, if I were to get the top 3 records in sort (ascending order), then,

require(data.table)
d <- data.table(mtcars, key="cyl")
d[, head(.SD, 3), by=cyl]

does it.

And if you want the descending order

d[, tail(.SD, 3), by=cyl] # Thanks @MatthewDowle

Edit: To sort out ties using mpg column:

d <- data.table(mtcars, key="cyl")
d.out <- d[, .SD[mpg %in% head(sort(unique(mpg)), 3)], by=cyl]

#     cyl  mpg  disp  hp drat    wt  qsec vs am gear carb rank
#  1:   4 22.8 108.0  93 3.85 2.320 18.61  1  1    4    1   11
#  2:   4 22.8 140.8  95 3.92 3.150 22.90  1  0    4    2    1
#  3:   4 21.5 120.1  97 3.70 2.465 20.01  1  0    3    1    8
#  4:   4 21.4 121.0 109 4.11 2.780 18.60  1  1    4    2    6
#  5:   6 18.1 225.0 105 2.76 3.460 20.22  1  0    3    1    7
#  6:   6 19.2 167.6 123 3.92 3.440 18.30  1  0    4    4    1
#  7:   6 17.8 167.6 123 3.92 3.440 18.90  1  0    4    4    2
#  8:   8 14.3 360.0 245 3.21 3.570 15.84  0  0    3    4    7
#  9:   8 10.4 472.0 205 2.93 5.250 17.98  0  0    3    4   14
# 10:   8 10.4 460.0 215 3.00 5.424 17.82  0  0    3    4    5
# 11:   8 13.3 350.0 245 3.73 3.840 15.41  0  0    3    4    3

# and for last N elements, of course it is straightforward
d.out <- d[, .SD[mpg %in% tail(sort(unique(mpg)), 3)], by=cyl]

~ Answered on 2013-02-10 17:09:59


21

Just sort by whatever (mpg for example, question is not clear on this)

mt <- mtcars[order(mtcars$mpg), ]

then use the by function to get the top n rows in each group

d <- by(mt, mt["cyl"], head, n=4)

If you want the result to be a data.frame:

Reduce(rbind, d)

Edit: Handling ties is more difficult, but if all ties are desired:

by(mt, mt["cyl"], function(x) x[rank(x$mpg) %in% sort(unique(rank(x$mpg)))[1:4], ])

Another approach is to break ties based on some other information, e.g.,

mt <- mtcars[order(mtcars$mpg, mtcars$hp), ]
by(mt, mt["cyl"], head, n=4)

~ Answered on 2013-02-10 17:45:13


Most Viewed Questions: