Select the top N values by group

Question

This is in response to a question asked on the r-help mailing list   Here are lots of examples of how to find top values by group using sql  so I imagine it s easy to convert that knowledge over using the R sqldf package   An example  when mtcars is grouped by cyl  here are the top three records for each distinct value of cyl  Note that ties are excluded in this case  but it d be nice to show some different ways to treat ties                        mpg cyl  disp  hp drat    wt  qsec vs am gear carb ranks Toyota Corona       21 5   4 120 1  97 3 70 2 465 20 01  1  0    3    1   2 0 Volvo 142E          21 4   4 121 0 109 4 11 2 780 18 60  1  1    4    2   1 0 Valiant             18 1   6 225 0 105 2 76 3 460 20 22  1  0    3    1   2 0 Merc 280            19 2   6 167 6 123 3 92 3 440 18 30  1  0    4    4   3 0 Merc 280C           17 8   6 167 6 123 3 92 3 440 18 90  1  0    4    4   1 0 Cadillac Fleetwood  10 4   8 472 0 205 2 93 5 250 17 98  0  0    3    4   1 5 Lincoln Continental 10 4   8 460 0 215 3 00 5 424 17 82  0  0    3    4   1 5 Camaro Z28          13 3   8 350 0 245 3 73 3 840 15 41  0  0    3    4   3 0   How to find the top or bottom  maximum or minimum  N records per group

User · Answer

You can write a function that splits the database by a factor, orders by another desired variable, extract the number of rows you want in each factor (category) and combine these into a database.

top<-function(x, num, c1,c2){
sorted<-x[with(x,order(x[,c1],x[,c2],decreasing=T)),]
splits<-split(sorted,sorted[,c1])
df<-lapply(splits,head,num)
do.call(rbind.data.frame,df)}

x is the dataframe;

num is the number of number of rows you would like to see;

c1 is the column number of the variable you would like to split by;

c2 is the column number of the variable you would like to rank by or handle ties.

Using the mtcars data, the function extracts the 3 heaviest cars (mtcars$wt is the 6th column) in each cylinder class (mtcars$cyl is the 2nd column)

 top(mtcars,3,2,6)
                         mpg cyl  disp  hp drat    wt  qsec vs am gear carb
 4.Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
 4.Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
 4.Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
 6.Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
 6.Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
 6.Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
 8.Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
 8.Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
 8.Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4

You can also easily get the lightest in a class by changing head in the lapply function to tail OR by removing the decreasing=T argument in the order function which will return it to its default, decreasing=F.

User · Answer

There are at least 4 ways to do this thing  however each has some difference  We using u id to group and using lift value to order sort  1 dplyr traditional way  library dplyr  top10 final subset1   final subset   gt   arrange desc lift     gt   group by u id    gt   slice 1 10    and if you switch the order of arrange desc lift   and group by u id  the result is essential the same And if there is tie for equal lift value it will slice to make sure each group has no more than 10 values  if you only have 5 lift value in the group  it will only gives you 5 results for that group   2 dplyr topN way  library dplyr  top10 final subset2   final subset   gt   group by u id    gt   top n 10 lift    this one if you have tie in lift value  say 15 same lift for the same u id  you will got all 15 observations  3 data table tail way  library data table  final subset   data table final subset key    lift   top10 final subset3   final subset  tail  SD 10   by   c  u id      It has the same row numbers as the first way  however  there are some rows are different  I guess they are using diff random algorithm dealing with tie   4 data table  SD way  library data table  top10 final subset4   final subset   SD order lift decreasing   TRUE    1 10  by    u id     This way is the most  uniform  way if in a group there are only 5 observation it will repeat value to make it to 10 observations and if there are ties it will still slice and only hold for 10 observations

User · Answer

If there were a tie at the fourth position for mtcars mpg then this should return all the ties   top mpg  lt - mtcars  mtcars mpg  gt   mtcars mpg order mtcars mpg  decreasing TRUE   4        gt  top mpg                 mpg cyl disp  hp drat    wt  qsec vs am gear carb Fiat 128       32 4   4 78 7  66 4 08 2 200 19 47  1  1    4    1 Honda Civic    30 4   4 75 7  52 4 93 1 615 18 52  1  1    4    2 Toyota Corolla 33 9   4 71 1  65 4 22 1 835 19 90  1  1    4    1 Lotus Europa   30 4   4 95 1 113 3 77 1 513 16 90  1  1    5    2   Since there is a tie at the 3-4 position you can test it by changing 4 to a 3  and it still returns 4 items  This is logical indexing and you might need to add a clause that removes the NA s or wrap which   around the logical expression  It s not much more difficult to do this  by  cyl    Reduce rbind   by mtcars  mtcars cyl           function d  d  d mpg  gt   d mpg order d mpg  decreasing TRUE   4          -------------                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb Fiat 128          32 4   4  78 7  66 4 08 2 200 19 47  1  1    4    1 Honda Civic       30 4   4  75 7  52 4 93 1 615 18 52  1  1    4    2 Toyota Corolla    33 9   4  71 1  65 4 22 1 835 19 90  1  1    4    1 Lotus Europa      30 4   4  95 1 113 3 77 1 513 16 90  1  1    5    2 Mazda RX4         21 0   6 160 0 110 3 90 2 620 16 46  0  1    4    4 Mazda RX4 Wag     21 0   6 160 0 110 3 90 2 875 17 02  0  1    4    4 Hornet 4 Drive    21 4   6 258 0 110 3 08 3 215 19 44  1  0    3    1 Ferrari Dino      19 7   6 145 0 175 3 62 2 770 15 50  0  1    5    6 Hornet Sportabout 18 7   8 360 0 175 3 15 3 440 17 02  0  0    3    2 Merc 450SE        16 4   8 275 8 180 3 07 4 070 17 40  0  0    3    3 Merc 450SL        17 3   8 275 8 180 3 07 3 730 17 60  0  0    3    3 Pontiac Firebird  19 2   8 400 0 175 3 08 3 845 17 05  0  0    3    2   Incorporating my suggestion to  Ista   Reduce rbind   by mtcars  mtcars cyl  function d  d  d mpg  lt   sort  d mpg   3

User · Answer

I prefer  Ista solution  cause needs no extra package and is simple  A modification of the data table solution also solve my problem  and is more general  My data frame is     gt  str df   data frame     579 obs  of  11 variables     trees       num  2000 5000 1000 2000 1000 1000 2000 5000 5000 1000        interDepth  num  2 3 5 2 3 4 4 2 3 5        minObs      num  6 4 1 4 10 6 10 10 6 6        shrinkage   num  0 01 0 001 0 01 0 005 0 01 0 01 0 001 0 005 0 005 0 001            G1          num  0 2 2 2 2 2 8 8 8 8        G2          logi  FALSE FALSE FALSE FALSE FALSE FALSE        qx          num  0 44 0 43 0 419 0 439 0 43        efet        num  43 1 40 6 39 9 39 2 38 6        prec        num  0 606 0 593 0 587 0 582 0 574 0 578 0 576 0 579 0 588 0 585        sens        num  0 575 0 57 0 573 0 575 0 587 0 574 0 576 0 566 0 542 0 545        acu         num  0 631 0 645 0 647 0 648 0 655 0 647 0 619 0 611 0 591 0 594       The data table solution needs order on i to do the job      gt  require data table   gt  dt1  lt - data table df   gt  dt2   dt1 order -efet  G1  G2   head  SD  3   by     G1  G2    gt  dt2     G1    G2 trees interDepth minObs shrinkage        qx   efet  prec  sens   acu  1   0 FALSE  2000          2      6     0 010 0 4395953 43 066 0 606 0 575 0 631  2   0 FALSE  2000          5      1     0 005 0 4294718 37 554 0 583 0 548 0 607  3   0 FALSE  5000          2      6     0 005 0 4395753 36 981 0 575 0 559 0 616  4   2 FALSE  5000          3      4     0 001 0 4296346 40 624 0 593 0 570 0 645  5   2 FALSE  1000          5      1     0 010 0 4186802 39 915 0 587 0 573 0 647  6   2 FALSE  2000          2      4     0 005 0 4390503 39 164 0 582 0 575 0 648  7   8 FALSE  2000          4     10     0 001 0 4511349 38 240 0 576 0 576 0 619  8   8 FALSE  5000          2     10     0 005 0 4469665 38 064 0 579 0 566 0 611  9   8 FALSE  5000          3      6     0 005 0 4426952 37 888 0 588 0 542 0 591 10   2  TRUE  5000          3      4     0 001 0 3812878 21 057 0 510 0 479 0 615 11   2  TRUE  2000          3     10     0 005 0 3790536 20 127 0 507 0 470 0 608 12   2  TRUE  1000          5      4     0 001 0 3690911 18 981 0 500 0 475 0 611 13   8  TRUE  5000          6     10     0 010 0 2865042 16 870 0 497 0 435 0 635 14   0  TRUE  2000          6      4     0 010 0 3192862  9 779 0 460 0 433 0 621     By some reason  it does not order the way pointed  probably because ordering by the groups   So  another ordering is done      gt  dt2 order G1  G2       G1    G2 trees interDepth minObs shrinkage        qx   efet  prec  sens   acu  1   0 FALSE  2000          2      6     0 010 0 4395953 43 066 0 606 0 575 0 631  2   0 FALSE  2000          5      1     0 005 0 4294718 37 554 0 583 0 548 0 607  3   0 FALSE  5000          2      6     0 005 0 4395753 36 981 0 575 0 559 0 616  4   0  TRUE  2000          6      4     0 010 0 3192862  9 779 0 460 0 433 0 621  5   2 FALSE  5000          3      4     0 001 0 4296346 40 624 0 593 0 570 0 645  6   2 FALSE  1000          5      1     0 010 0 4186802 39 915 0 587 0 573 0 647  7   2 FALSE  2000          2      4     0 005 0 4390503 39 164 0 582 0 575 0 648  8   2  TRUE  5000          3      4     0 001 0 3812878 21 057 0 510 0 479 0 615  9   2  TRUE  2000          3     10     0 005 0 3790536 20 127 0 507 0 470 0 608 10   2  TRUE  1000          5      4     0 001 0 3690911 18 981 0 500 0 475 0 611 11   8 FALSE  2000          4     10     0 001 0 4511349 38 240 0 576 0 576 0 619 12   8 FALSE  5000          2     10     0 005 0 4469665 38 064 0 579 0 566 0 611 13   8 FALSE  5000          3      6     0 005 0 4426952 37 888 0 588 0 542 0 591 14   8  TRUE  5000          6     10     0 010 0 2865042 16 870 0 497 0 435 0 635

User · Answer

Since dplyr 1 0 0  the slice max   slice min   functions were implemented   mtcars   gt    group by cyl    gt    slice max mpg  n   2  with ties   FALSE       mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb    lt dbl gt   lt dbl gt   lt dbl gt   lt dbl gt   lt dbl gt   lt dbl gt   lt dbl gt   lt dbl gt   lt dbl gt   lt dbl gt   lt dbl gt  1  33 9     4  71 1    65  4 22  1 84  19 9     1     1     4     1 2  32 4     4  78 7    66  4 08  2 2   19 5     1     1     4     1 3  21 4     6 258     110  3 08  3 22  19 4     1     0     3     1 4  21       6 160     110  3 9   2 62  16 5     0     1     4     4 5  19 2     8 400     175  3 08  3 84  17 0     0     0     3     2 6  18 7     8 360     175  3 15  3 44  17 0     0     0     3     2   The documentation on with ties parameter      Should ties be kept together  The default  TRUE  may return more rows   than you request  Use FALSE to ignore ties  and return the first n   rows

User · Answer

dplyr does the trick  mtcars   gt    arrange desc mpg     gt    group by cyl    gt   slice 1 2     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb    lt dbl gt   lt dbl gt   lt dbl gt   lt dbl gt   lt dbl gt   lt dbl gt   lt dbl gt   lt dbl gt   lt dbl gt   lt dbl gt   lt dbl gt  1  33 9     4  71 1    65  4 22 1 835 19 90     1     1     4     1 2  32 4     4  78 7    66  4 08 2 200 19 47     1     1     4     1 3  21 4     6 258 0   110  3 08 3 215 19 44     1     0     3     1 4  21 0     6 160 0   110  3 90 2 620 16 46     0     1     4     4 5  19 2     8 400 0   175  3 08 3 845 17 05     0     0     3     2 6  18 7     8 360 0   175  3 15 3 440 17 02     0     0     3     2

User · Answer

Just sort by whatever  mpg for example  question is not clear on this   mt  lt - mtcars order mtcars mpg       then use the by function to get the top n rows in each group  d  lt - by mt  mt  cyl    head  n 4    If you want the result to be a data frame   Reduce rbind  d    Edit  Handling ties is more difficult  but if all ties are desired   by mt  mt  cyl    function x  x rank x mpg   in  sort unique rank x mpg    1 4        Another approach is to break ties based on some other information  e g     mt  lt - mtcars order mtcars mpg  mtcars hp     by mt  mt  cyl    head  n 4

User · Answer

start with the mtcars data frame  included with your installation of R  mtcars    pick your  group by  variable gbv  lt -  cyl    IMPORTANT NOTE  you can only include one group by variable here     if you need more  the  order  function below will need   one per inputted parameter  order  x cyl   x am      choose whether you want to find the minimum or maximum find maximum  lt - FALSE    create a simple data frame with only two columns x  lt - mtcars    order it based on  x  lt - x  order  x    gbv     decreasing   find maximum          figure out the ranks of each miles-per-gallon  within cyl columns if   find maximum          note the negative sign  which changes the order of mpg         and  the  rev  function  which flips the order of the  tapply  result     x ranks  lt - unlist  rev  tapply  -x mpg   x    gbv     rank         else       x ranks  lt - unlist  tapply  x mpg   x    gbv     rank         now just subset it based on the rank column result  lt - x  x ranks  lt   3        look at your results result    done     but note only  two  values where cyl    4 were kept    because there was a tie for third smallest  and the  rank  function gave both  3 5  x  x ranks    3 5          if you instead wanted to keep all ties  you could change the   tie-breaking behavior of the  rank  function    using the  min   includes  all ties   using  max  would  exclude  all ties if   find maximum          note the negative sign  which changes the order of mpg         and  the  rev  function  which flips the order of the  tapply  result     x ranks  lt - unlist  rev  tapply  -x mpg   x    gbv     rank   ties method    min          else       x ranks  lt - unlist  tapply  x mpg   x    gbv     rank   ties method    min          and there are even more options     see  rank for more methods    now just subset it based on the rank column result  lt - x  x ranks  lt   3        look at your results result   and notice  both  cyl    4 and ranks    3 were included in your results   because of the tie-breaking behavior chosen

User · Answer

This seems more straightforward using data table as it performs the sort while setting the key   So  if I were to get the top 3 records in sort  ascending order   then   require data table  d  lt - data table mtcars  key  cyl   d   head  SD  3   by cyl    does it   And if you want the descending order  d   tail  SD  3   by cyl    Thanks  MatthewDowle   Edit  To sort out ties using mpg column    d  lt - data table mtcars  key  cyl   d out  lt - d    SD mpg  in  head sort unique mpg    3    by cyl         cyl  mpg  disp  hp drat    wt  qsec vs am gear carb rank    1    4 22 8 108 0  93 3 85 2 320 18 61  1  1    4    1   11    2    4 22 8 140 8  95 3 92 3 150 22 90  1  0    4    2    1    3    4 21 5 120 1  97 3 70 2 465 20 01  1  0    3    1    8    4    4 21 4 121 0 109 4 11 2 780 18 60  1  1    4    2    6    5    6 18 1 225 0 105 2 76 3 460 20 22  1  0    3    1    7    6    6 19 2 167 6 123 3 92 3 440 18 30  1  0    4    4    1    7    6 17 8 167 6 123 3 92 3 440 18 90  1  0    4    4    2    8    8 14 3 360 0 245 3 21 3 570 15 84  0  0    3    4    7    9    8 10 4 472 0 205 2 93 5 250 17 98  0  0    3    4   14   10    8 10 4 460 0 215 3 00 5 424 17 82  0  0    3    4    5   11    8 13 3 350 0 245 3 73 3 840 15 41  0  0    3    4    3    and for last N elements  of course it is straightforward d out  lt - d    SD mpg  in  tail sort unique mpg    3    by cyl

[r] Select the top N values by group

Examples related to r

Examples related to aggregate