[r] Select the top N values by group

This is in response to a question asked on the r-help mailing list.

Here are lots of examples of how to find top values by group using sql, so I imagine it's easy to convert that knowledge over using the R sqldf package.

An example: when mtcars is grouped by cyl, here are the top three records for each distinct value of cyl. Note that ties are excluded in this case, but it'd be nice to show some different ways to treat ties.

                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb ranks
Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1   2.0
Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2   1.0
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1   2.0
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4   3.0
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4   1.0
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4   1.5
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4   1.5
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4   3.0

How to find the top or bottom (maximum or minimum) N records per group?

This question is related to r aggregate

The answer is


# start with the mtcars data frame (included with your installation of R)
mtcars

# pick your 'group by' variable
gbv <- 'cyl'
# IMPORTANT NOTE: you can only include one group by variable here
# ..if you need more, the `order` function below will need
# one per inputted parameter: order( x$cyl , x$am )

# choose whether you want to find the minimum or maximum
find.maximum <- FALSE

# create a simple data frame with only two columns
x <- mtcars

# order it based on 
x <- x[ order( x[ , gbv ] , decreasing = find.maximum ) , ]

# figure out the ranks of each miles-per-gallon, within cyl columns
if ( find.maximum ){
    # note the negative sign (which changes the order of mpg)
    # *and* the `rev` function, which flips the order of the `tapply` result
    x$ranks <- unlist( rev( tapply( -x$mpg , x[ , gbv ] , rank ) ) )
} else {
    x$ranks <- unlist( tapply( x$mpg , x[ , gbv ] , rank ) )
}
# now just subset it based on the rank column
result <- x[ x$ranks <= 3 , ]

# look at your results
result

# done!

# but note only *two* values where cyl == 4 were kept,
# because there was a tie for third smallest, and the `rank` function gave both '3.5'
x[ x$ranks == 3.5 , ]

# ..if you instead wanted to keep all ties, you could change the
# tie-breaking behavior of the `rank` function.
# using the `min` *includes* all ties.  using `max` would *exclude* all ties
if ( find.maximum ){
    # note the negative sign (which changes the order of mpg)
    # *and* the `rev` function, which flips the order of the `tapply` result
    x$ranks <- unlist( rev( tapply( -x$mpg , x[ , gbv ] , rank , ties.method = 'min' ) ) )
} else {
    x$ranks <- unlist( tapply( x$mpg , x[ , gbv ] , rank , ties.method = 'min' ) )
}
# and there are even more options..
# see ?rank for more methods

# now just subset it based on the rank column
result <- x[ x$ranks <= 3 , ]

# look at your results
result
# and notice *both* cyl == 4 and ranks == 3 were included in your results
# because of the tie-breaking behavior chosen.

I prefer @Ista solution, cause needs no extra package and is simple.
A modification of the data.table solution also solve my problem, and is more general.
My data.frame is

> str(df)
'data.frame':   579 obs. of  11 variables:
 $ trees     : num  2000 5000 1000 2000 1000 1000 2000 5000 5000 1000 ...
 $ interDepth: num  2 3 5 2 3 4 4 2 3 5 ...
 $ minObs    : num  6 4 1 4 10 6 10 10 6 6 ...
 $ shrinkage : num  0.01 0.001 0.01 0.005 0.01 0.01 0.001 0.005 0.005 0.001     ...
 $ G1        : num  0 2 2 2 2 2 8 8 8 8 ...
 $ G2        : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ qx        : num  0.44 0.43 0.419 0.439 0.43 ...
 $ efet      : num  43.1 40.6 39.9 39.2 38.6 ...
 $ prec      : num  0.606 0.593 0.587 0.582 0.574 0.578 0.576 0.579 0.588 0.585 ...
 $ sens      : num  0.575 0.57 0.573 0.575 0.587 0.574 0.576 0.566 0.542 0.545 ...
 $ acu       : num  0.631 0.645 0.647 0.648 0.655 0.647 0.619 0.611 0.591 0.594 ...

The data.table solution needs order on i to do the job:

> require(data.table)
> dt1 <- data.table(df)
> dt2 = dt1[order(-efet, G1, G2), head(.SD, 3), by = .(G1, G2)]
> dt2
    G1    G2 trees interDepth minObs shrinkage        qx   efet  prec  sens   acu
 1:  0 FALSE  2000          2      6     0.010 0.4395953 43.066 0.606 0.575 0.631
 2:  0 FALSE  2000          5      1     0.005 0.4294718 37.554 0.583 0.548 0.607
 3:  0 FALSE  5000          2      6     0.005 0.4395753 36.981 0.575 0.559 0.616
 4:  2 FALSE  5000          3      4     0.001 0.4296346 40.624 0.593 0.570 0.645
 5:  2 FALSE  1000          5      1     0.010 0.4186802 39.915 0.587 0.573 0.647
 6:  2 FALSE  2000          2      4     0.005 0.4390503 39.164 0.582 0.575 0.648
 7:  8 FALSE  2000          4     10     0.001 0.4511349 38.240 0.576 0.576 0.619
 8:  8 FALSE  5000          2     10     0.005 0.4469665 38.064 0.579 0.566 0.611
 9:  8 FALSE  5000          3      6     0.005 0.4426952 37.888 0.588 0.542 0.591
10:  2  TRUE  5000          3      4     0.001 0.3812878 21.057 0.510 0.479 0.615
11:  2  TRUE  2000          3     10     0.005 0.3790536 20.127 0.507 0.470 0.608
12:  2  TRUE  1000          5      4     0.001 0.3690911 18.981 0.500 0.475 0.611
13:  8  TRUE  5000          6     10     0.010 0.2865042 16.870 0.497 0.435 0.635
14:  0  TRUE  2000          6      4     0.010 0.3192862  9.779 0.460 0.433 0.621  

By some reason, it does not order the way pointed (probably because ordering by the groups). So, another ordering is done.

> dt2[order(G1, G2)]
    G1    G2 trees interDepth minObs shrinkage        qx   efet  prec  sens   acu
 1:  0 FALSE  2000          2      6     0.010 0.4395953 43.066 0.606 0.575 0.631
 2:  0 FALSE  2000          5      1     0.005 0.4294718 37.554 0.583 0.548 0.607
 3:  0 FALSE  5000          2      6     0.005 0.4395753 36.981 0.575 0.559 0.616
 4:  0  TRUE  2000          6      4     0.010 0.3192862  9.779 0.460 0.433 0.621
 5:  2 FALSE  5000          3      4     0.001 0.4296346 40.624 0.593 0.570 0.645
 6:  2 FALSE  1000          5      1     0.010 0.4186802 39.915 0.587 0.573 0.647
 7:  2 FALSE  2000          2      4     0.005 0.4390503 39.164 0.582 0.575 0.648
 8:  2  TRUE  5000          3      4     0.001 0.3812878 21.057 0.510 0.479 0.615
 9:  2  TRUE  2000          3     10     0.005 0.3790536 20.127 0.507 0.470 0.608
10:  2  TRUE  1000          5      4     0.001 0.3690911 18.981 0.500 0.475 0.611
11:  8 FALSE  2000          4     10     0.001 0.4511349 38.240 0.576 0.576 0.619
12:  8 FALSE  5000          2     10     0.005 0.4469665 38.064 0.579 0.566 0.611
13:  8 FALSE  5000          3      6     0.005 0.4426952 37.888 0.588 0.542 0.591
14:  8  TRUE  5000          6     10     0.010 0.2865042 16.870 0.497 0.435 0.635

Just sort by whatever (mpg for example, question is not clear on this)

mt <- mtcars[order(mtcars$mpg), ]

then use the by function to get the top n rows in each group

d <- by(mt, mt["cyl"], head, n=4)

If you want the result to be a data.frame:

Reduce(rbind, d)

Edit: Handling ties is more difficult, but if all ties are desired:

by(mt, mt["cyl"], function(x) x[rank(x$mpg) %in% sort(unique(rank(x$mpg)))[1:4], ])

Another approach is to break ties based on some other information, e.g.,

mt <- mtcars[order(mtcars$mpg, mtcars$hp), ]
by(mt, mt["cyl"], head, n=4)

This seems more straightforward using data.table as it performs the sort while setting the key.

So, if I were to get the top 3 records in sort (ascending order), then,

require(data.table)
d <- data.table(mtcars, key="cyl")
d[, head(.SD, 3), by=cyl]

does it.

And if you want the descending order

d[, tail(.SD, 3), by=cyl] # Thanks @MatthewDowle

Edit: To sort out ties using mpg column:

d <- data.table(mtcars, key="cyl")
d.out <- d[, .SD[mpg %in% head(sort(unique(mpg)), 3)], by=cyl]

#     cyl  mpg  disp  hp drat    wt  qsec vs am gear carb rank
#  1:   4 22.8 108.0  93 3.85 2.320 18.61  1  1    4    1   11
#  2:   4 22.8 140.8  95 3.92 3.150 22.90  1  0    4    2    1
#  3:   4 21.5 120.1  97 3.70 2.465 20.01  1  0    3    1    8
#  4:   4 21.4 121.0 109 4.11 2.780 18.60  1  1    4    2    6
#  5:   6 18.1 225.0 105 2.76 3.460 20.22  1  0    3    1    7
#  6:   6 19.2 167.6 123 3.92 3.440 18.30  1  0    4    4    1
#  7:   6 17.8 167.6 123 3.92 3.440 18.90  1  0    4    4    2
#  8:   8 14.3 360.0 245 3.21 3.570 15.84  0  0    3    4    7
#  9:   8 10.4 472.0 205 2.93 5.250 17.98  0  0    3    4   14
# 10:   8 10.4 460.0 215 3.00 5.424 17.82  0  0    3    4    5
# 11:   8 13.3 350.0 245 3.73 3.840 15.41  0  0    3    4    3

# and for last N elements, of course it is straightforward
d.out <- d[, .SD[mpg %in% tail(sort(unique(mpg)), 3)], by=cyl]

dplyr does the trick

mtcars %>% 
arrange(desc(mpg)) %>% 
group_by(cyl) %>% slice(1:2)


 mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1  33.9     4  71.1    65  4.22 1.835 19.90     1     1     4     1
2  32.4     4  78.7    66  4.08 2.200 19.47     1     1     4     1
3  21.4     6 258.0   110  3.08 3.215 19.44     1     0     3     1
4  21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4
5  19.2     8 400.0   175  3.08 3.845 17.05     0     0     3     2
6  18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2

If there were a tie at the fourth position for mtcars$mpg then this should return all the ties:

top_mpg <- mtcars[ mtcars$mpg >= mtcars$mpg[order(mtcars$mpg, decreasing=TRUE)][4] , ]

> top_mpg
                mpg cyl disp  hp drat    wt  qsec vs am gear carb
Fiat 128       32.4   4 78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic    30.4   4 75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla 33.9   4 71.1  65 4.22 1.835 19.90  1  1    4    1
Lotus Europa   30.4   4 95.1 113 3.77 1.513 16.90  1  1    5    2

Since there is a tie at the 3-4 position you can test it by changing 4 to a 3, and it still returns 4 items. This is logical indexing and you might need to add a clause that removes the NA's or wrap which() around the logical expression. It's not much more difficult to do this "by" cyl:

 Reduce(rbind,  by(mtcars, mtcars$cyl, 
        function(d) d[ d$mpg >= d$mpg[order(d$mpg, decreasing=TRUE)][4] , ]) )
#-------------
                   mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Fiat 128          32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic       30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla    33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Lotus Europa      30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Ferrari Dino      19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Merc 450SE        16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL        17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Pontiac Firebird  19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2

Incorporating my suggestion to @Ista:

Reduce(rbind,  by(mtcars, mtcars$cyl, function(d) d[ d$mpg <= sort( d$mpg )[3] , ]) )

Since dplyr 1.0.0, the slice_max()/slice_min() functions were implemented:

mtcars %>%
 group_by(cyl) %>%
 slice_max(mpg, n = 2, with_ties = FALSE)

    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1  33.9     4  71.1    65  4.22  1.84  19.9     1     1     4     1
2  32.4     4  78.7    66  4.08  2.2   19.5     1     1     4     1
3  21.4     6 258     110  3.08  3.22  19.4     1     0     3     1
4  21       6 160     110  3.9   2.62  16.5     0     1     4     4
5  19.2     8 400     175  3.08  3.84  17.0     0     0     3     2
6  18.7     8 360     175  3.15  3.44  17.0     0     0     3     2

The documentation on with_ties parameter:

Should ties be kept together? The default, TRUE, may return more rows than you request. Use FALSE to ignore ties, and return the first n rows.


You can write a function that splits the database by a factor, orders by another desired variable, extract the number of rows you want in each factor (category) and combine these into a database.

top<-function(x, num, c1,c2){
sorted<-x[with(x,order(x[,c1],x[,c2],decreasing=T)),]
splits<-split(sorted,sorted[,c1])
df<-lapply(splits,head,num)
do.call(rbind.data.frame,df)}

x is the dataframe;

num is the number of number of rows you would like to see;

c1 is the column number of the variable you would like to split by;

c2 is the column number of the variable you would like to rank by or handle ties.

Using the mtcars data, the function extracts the 3 heaviest cars (mtcars$wt is the 6th column) in each cylinder class (mtcars$cyl is the 2nd column)

 top(mtcars,3,2,6)
                         mpg cyl  disp  hp drat    wt  qsec vs am gear carb
 4.Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
 4.Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
 4.Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
 6.Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
 6.Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
 6.Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
 8.Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
 8.Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
 8.Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4

You can also easily get the lightest in a class by changing head in the lapply function to tail OR by removing the decreasing=T argument in the order function which will return it to its default, decreasing=F.


There are at least 4 ways to do this thing, however,each has some difference. We using u_id to group and using lift value to order/sort

1 dplyr traditional way

library(dplyr)
top10_final_subset1 = final_subset %>% arrange(desc(lift)) %>% group_by(u_id) %>% slice(1:10)

and if you switch the order of arrange(desc(lift)) and group_by(u_id) the result is essential the same.And if there is tie for equal lift value,it will slice to make sure each group has no more than 10 values, if you only have 5 lift value in the group, it will only gives you 5 results for that group.

2 dplyr topN way

library(dplyr)
top10_final_subset2 = final_subset %>% group_by(u_id) %>% top_n(10,lift)

this one if you have tie in lift value, say 15 same lift for the same u_id, you will got all 15 observations

3 data.table tail way

library(data.table)
final_subset = data.table(final_subset,key = "lift")
top10_final_subset3 = final_subset[,tail(.SD,10),,by = c("u_id")]

It has the same row numbers as the first way, however, there are some rows are different, I guess they are using diff random algorithm dealing with tie.

4 data.table .SD way

library(data.table)
top10_final_subset4 = final_subset[,.SD[order(lift,decreasing = TRUE),][1:10],by = "u_id"]

This way is the most "uniform" way,if in a group there are only 5 observation it will repeat value to make it to 10 observations and if there are ties it will still slice and only hold for 10 observations.