[r] How to select the row with the maximum value in each group

In a dataset with multiple observations for each subject I want to take a subset with only the maximum data value for each record. For example, with a following dataset:

ID    <- c(1,1,1,2,2,2,2,3,3)
Value <- c(2,3,5,2,5,8,17,3,5)
Event <- c(1,1,2,1,2,1,2,2,2)

group <- data.frame(Subject=ID, pt=Value, Event=Event)

Subject 1, 2, and 3 have the biggest pt value of 5, 17, and 5 respectively.

How could I first find the biggest pt value for each subject, and then, put this observation in another data frame? The resulting data frame should only have the biggest pt values for each subject.

This question is related to r dataframe r-faq

The answer is


Another data.table solution:

library(data.table)
setDT(group)[, head(.SD[order(-pt)], 1), by = .(Subject)]

Since {dplyr} v1.0.0 (May 2020) there is the new slice_* syntax which supersedes top_n().

See also https://dplyr.tidyverse.org/reference/slice.html.

library(tidyverse)

ID    <- c(1,1,1,2,2,2,2,3,3)
Value <- c(2,3,5,2,5,8,17,3,5)
Event <- c(1,1,2,1,2,1,2,2,2)

group <- data.frame(Subject=ID, pt=Value, Event=Event)

group %>% 
  group_by(Subject) %>% 
  slice_max(pt)
#> # A tibble: 3 x 3
#> # Groups:   Subject [3]
#>   Subject    pt Event
#>     <dbl> <dbl> <dbl>
#> 1       1     5     2
#> 2       2    17     2
#> 3       3     5     2

Created on 2020-08-18 by the reprex package (v0.3.0.9001)

Session info
sessioninfo::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value                                      
#>  version  R version 4.0.2 Patched (2020-06-30 r78761)
#>  os       macOS Catalina 10.15.6                     
#>  system   x86_64, darwin17.0                         
#>  ui       X11                                        
#>  language (EN)                                       
#>  collate  en_US.UTF-8                                
#>  ctype    en_US.UTF-8                                
#>  tz       Europe/Berlin                              
#>  date     2020-08-18                                 
#> 
#> - Packages -------------------------------------------------------------------
#>  package     * version    date       lib source                            
#>  assertthat    0.2.1      2019-03-21 [1] CRAN (R 4.0.0)                    
#>  backports     1.1.8      2020-06-17 [1] CRAN (R 4.0.1)                    
#>  blob          1.2.1      2020-01-20 [1] CRAN (R 4.0.0)                    
#>  broom         0.7.0      2020-07-09 [1] CRAN (R 4.0.2)                    
#>  cellranger    1.1.0      2016-07-27 [1] CRAN (R 4.0.0)                    
#>  cli           2.0.2      2020-02-28 [1] CRAN (R 4.0.0)                    
#>  colorspace    1.4-1      2019-03-18 [1] CRAN (R 4.0.0)                    
#>  crayon        1.3.4      2017-09-16 [1] CRAN (R 4.0.0)                    
#>  DBI           1.1.0      2019-12-15 [1] CRAN (R 4.0.0)                    
#>  dbplyr        1.4.4      2020-05-27 [1] CRAN (R 4.0.0)                    
#>  digest        0.6.25     2020-02-23 [1] CRAN (R 4.0.0)                    
#>  dplyr       * 1.0.1      2020-07-31 [1] CRAN (R 4.0.2)                    
#>  ellipsis      0.3.1      2020-05-15 [1] CRAN (R 4.0.0)                    
#>  evaluate      0.14       2019-05-28 [1] CRAN (R 4.0.0)                    
#>  fansi         0.4.1      2020-01-08 [1] CRAN (R 4.0.0)                    
#>  forcats     * 0.5.0      2020-03-01 [1] CRAN (R 4.0.0)                    
#>  fs            1.5.0      2020-07-31 [1] CRAN (R 4.0.2)                    
#>  generics      0.0.2      2018-11-29 [1] CRAN (R 4.0.0)                    
#>  ggplot2     * 3.3.2      2020-06-19 [1] CRAN (R 4.0.1)                    
#>  glue          1.4.1      2020-05-13 [1] CRAN (R 4.0.0)                    
#>  gtable        0.3.0      2019-03-25 [1] CRAN (R 4.0.0)                    
#>  haven         2.3.1      2020-06-01 [1] CRAN (R 4.0.0)                    
#>  highr         0.8        2019-03-20 [1] CRAN (R 4.0.0)                    
#>  hms           0.5.3      2020-01-08 [1] CRAN (R 4.0.0)                    
#>  htmltools     0.5.0      2020-06-16 [1] CRAN (R 4.0.1)                    
#>  httr          1.4.2      2020-07-20 [1] CRAN (R 4.0.2)                    
#>  jsonlite      1.7.0      2020-06-25 [1] CRAN (R 4.0.2)                    
#>  knitr         1.29       2020-06-23 [1] CRAN (R 4.0.2)                    
#>  lifecycle     0.2.0      2020-03-06 [1] CRAN (R 4.0.0)                    
#>  lubridate     1.7.9      2020-06-08 [1] CRAN (R 4.0.1)                    
#>  magrittr      1.5        2014-11-22 [1] CRAN (R 4.0.0)                    
#>  modelr        0.1.8      2020-05-19 [1] CRAN (R 4.0.0)                    
#>  munsell       0.5.0      2018-06-12 [1] CRAN (R 4.0.0)                    
#>  pillar        1.4.6      2020-07-10 [1] CRAN (R 4.0.2)                    
#>  pkgconfig     2.0.3      2019-09-22 [1] CRAN (R 4.0.0)                    
#>  purrr       * 0.3.4      2020-04-17 [1] CRAN (R 4.0.0)                    
#>  R6            2.4.1      2019-11-12 [1] CRAN (R 4.0.0)                    
#>  Rcpp          1.0.5      2020-07-06 [1] CRAN (R 4.0.2)                    
#>  readr       * 1.3.1      2018-12-21 [1] CRAN (R 4.0.0)                    
#>  readxl        1.3.1      2019-03-13 [1] CRAN (R 4.0.0)                    
#>  reprex        0.3.0.9001 2020-08-13 [1] Github (tidyverse/reprex@23a3462) 
#>  rlang         0.4.7      2020-07-09 [1] CRAN (R 4.0.2)                    
#>  rmarkdown     2.3.3      2020-07-26 [1] Github (rstudio/rmarkdown@204aa41)
#>  rstudioapi    0.11       2020-02-07 [1] CRAN (R 4.0.0)                    
#>  rvest         0.3.6      2020-07-25 [1] CRAN (R 4.0.2)                    
#>  scales        1.1.1      2020-05-11 [1] CRAN (R 4.0.0)                    
#>  sessioninfo   1.1.1      2018-11-05 [1] CRAN (R 4.0.2)                    
#>  stringi       1.4.6      2020-02-17 [1] CRAN (R 4.0.0)                    
#>  stringr     * 1.4.0      2019-02-10 [1] CRAN (R 4.0.0)                    
#>  styler        1.3.2.9000 2020-07-05 [1] Github (pat-s/styler@51d5200)     
#>  tibble      * 3.0.3      2020-07-10 [1] CRAN (R 4.0.2)                    
#>  tidyr       * 1.1.1      2020-07-31 [1] CRAN (R 4.0.2)                    
#>  tidyselect    1.1.0      2020-05-11 [1] CRAN (R 4.0.0)                    
#>  tidyverse   * 1.3.0      2019-11-21 [1] CRAN (R 4.0.0)                    
#>  utf8          1.1.4      2018-05-24 [1] CRAN (R 4.0.0)                    
#>  vctrs         0.3.2      2020-07-15 [1] CRAN (R 4.0.2)                    
#>  withr         2.2.0      2020-04-20 [1] CRAN (R 4.0.0)                    
#>  xfun          0.16       2020-07-24 [1] CRAN (R 4.0.2)                    
#>  xml2          1.3.2      2020-04-23 [1] CRAN (R 4.0.0)                    
#>  yaml          2.2.1      2020-02-01 [1] CRAN (R 4.0.0)                    
#> 
#> [1] /Users/pjs/Library/R/4.0/library
#> [2] /Library/Frameworks/R.framework/Versions/4.0/Resources/library

One more base R solution:

merge(aggregate(pt ~ Subject, max, data = group), group)

  Subject pt Event
1       1  5     2
2       2 17     2
3       3  5     2

A dplyr solution:

library(dplyr)
ID <- c(1,1,1,2,2,2,2,3,3)
Value <- c(2,3,5,2,5,8,17,3,5)
Event <- c(1,1,2,1,2,1,2,2,2)
group <- data.frame(Subject=ID, pt=Value, Event=Event)

group %>%
    group_by(Subject) %>%
    summarize(max.pt = max(pt))

This yields the following data frame:

  Subject max.pt
1       1      5
2       2     17
3       3      5

A shorter solution using data.table:

setDT(group)[, .SD[which.max(pt)], by=Subject]
#    Subject pt Event
# 1:       1  5     2
# 2:       2 17     2
# 3:       3  5     2

In base you can use ave to get max per group and compare this with pt and get a logical vector to subset the data.frame.

group[group$pt == ave(group$pt, group$Subject, FUN=max),]
#  Subject pt Event
#3       1  5     2
#7       2 17     2
#9       3  5     2

Or compare it already in the function.

group[as.logical(ave(group$pt, group$Subject, FUN=function(x) x==max(x))),]
#group[ave(group$pt, group$Subject, FUN=function(x) x==max(x))==1,] #Variant
#  Subject pt Event
#3       1  5     2
#7       2 17     2
#9       3  5     2

The most intuitive method is to use group_by and top_n function in dplyr

    group %>% group_by(Subject) %>% top_n(1, pt)

The result you get is

    Source: local data frame [3 x 3]
    Groups: Subject [3]

      Subject    pt Event
        (dbl) (dbl) (dbl)
    1       1     5     2
    2       2    17     2
    3       3     5     2

Another option is slice

library(dplyr)
group %>%
     group_by(Subject) %>%
     slice(which.max(pt))
#    Subject    pt Event
#    <dbl> <dbl> <dbl>
#1       1     5     2
#2       2    17     2
#3       3     5     2

If you want the biggest pt value for a subject, you could simply use:

   pt_max = as.data.frame(aggregate(pt~Subject, group, max))

Another data.table option:

library(data.table)
setDT(group)
group[group[order(-pt), .I[1L], Subject]$V1]

Or another (less readable but slightly faster):

group[group[, rn := .I][order(Subject, -pt), {
    rn[c(1L, 1L + which(diff(Subject)>0L))]
}]]

timing code:

library(data.table)
nr <- 1e7L
ng <- nr/4L
set.seed(0L)
DT <- data.table(Subject=sample(ng, nr, TRUE), pt=1:nr)#rnorm(nr))
DT2 <- copy(DT)


microbenchmark::microbenchmark(times=3L,
    mtd0 = {a0 <- DT[DT[, .I[which.max(pt)], by=Subject]$V1]},
    mtd1 = {a1 <- DT[DT[order(-pt), .I[1L], Subject]$V1]},
    mtd2 = {a2 <- DT2[DT2[, rn := .I][
        order(Subject, -pt), rn[c(TRUE, diff(Subject)>0L)]
    ]]},
    mtd3 = {a3 <- unique(DT[order(Subject, -pt)], by="Subject")}
)
fsetequal(a0[order(Subject)], a1[order(Subject)])
#[1] TRUE
fsetequal(a0[order(Subject)], a2[, rn := NULL][order(Subject)])
#[1] TRUE
fsetequal(a0[order(Subject)], a3[order(Subject)])
#[1] TRUE

timings:

Unit: seconds
 expr      min       lq     mean   median       uq      max neval
 mtd0 3.256322 3.335412 3.371439 3.414502 3.428998 3.443493     3
 mtd1 1.733162 1.748538 1.786033 1.763915 1.812468 1.861022     3
 mtd2 1.136307 1.159606 1.207009 1.182905 1.242359 1.301814     3
 mtd3 1.123064 1.166161 1.228058 1.209257 1.280554 1.351851     3

Another base solution

group_sorted <- group[order(group$Subject, -group$pt),]
group_sorted[!duplicated(group_sorted$Subject),]

# Subject pt Event
#       1  5     2
#       2 17     2
#       3  5     2

Order the data frame by pt (descending) and then remove rows duplicated in Subject


by is a version of tapply for data frames:

res <- by(group, group$Subject, FUN=function(df) df[which.max(df$pt),])

It returns an object of class by so we convert it to data frame:

do.call(rbind, b)
  Subject pt Event
1       1  5     2
2       2 17     2
3       3  5     2

Using dplyr 1.0.2 there are now two ways to do this, one is long hand and the other is using the verb across():

      # create data
      ID    <- c(1,1,1,2,2,2,2,3,3)
      Value <- c(2,3,5,2,5,8,17,3,5)
      Event <- c(1,1,2,1,2,1,2,2,2)
      
      group <- data.frame(Subject=ID, pt=Value, Event=Event)

Long hand the verb is max() but note the na.rm = TRUE which is useful for examples where there are NAs as in the closed question: Merge rows in a dataframe where the rows are disjoint and contain NAs:

       group %>% 
        group_by(Subject) %>% 
        summarise(pt = max(pt, na.rm = TRUE),
                  Event = max(Event, na.rm = TRUE))

This is ok if there are only a few columns but if the table has many columns across() is useful. The examples for this verb are often with summarise(across(start_with... but in this example the columns don't start with the same characters. Either they could be changed or the positions listed:

    group %>% 
        group_by(Subject) %>% 
        summarise(across(1:ncol(group)-1, max, na.rm = TRUE, .names = "{.col}"))

Note for the verb across() 1 refers to the first column after the first actual column so using ncol(group) won't work as that is too many columns (makes it position 4 rather than 3).


do.call(rbind, lapply(split(group,as.factor(group$Subject)), function(x) {return(x[which.max(x$pt),])}))

Using Base R


I wasn't sure what you wanted to do about the Event column, but if you want to keep that as well, how about

isIDmax <- with(dd, ave(Value, ID, FUN=function(x) seq_along(x)==which.max(x)))==1
group[isIDmax, ]

#   ID Value Event
# 3  1     5     2
# 7  2    17     2
# 9  3     5     2

Here we use ave to look at the "Value" column for each "ID". Then we determine which value is the maximal and then turn that into a logical vector we can use to subset the original data.frame.


Here's another data.table solution, since which.max does not work on characters

library(data.table)
group <- data.table(Subject=ID, pt=Value, Event=Event)

group[, .SD[order(pt, decreasing = TRUE) == 1], by = Subject]

Examples related to r

How to get AIC from Conway–Maxwell-Poisson regression via COM-poisson package in R? R : how to simply repeat a command? session not created: This version of ChromeDriver only supports Chrome version 74 error with ChromeDriver Chrome using Selenium How to show code but hide output in RMarkdown? remove kernel on jupyter notebook Function to calculate R2 (R-squared) in R Center Plot title in ggplot2 R ggplot2: stat_count() must not be used with a y aesthetic error in Bar graph R multiple conditions in if statement What does "The following object is masked from 'package:xxx'" mean?

Examples related to dataframe

Trying to merge 2 dataframes but get ValueError How to show all of columns name on pandas dataframe? Python Pandas - Find difference between two data frames Pandas get the most frequent values of a column Display all dataframe columns in a Jupyter Python Notebook How to convert column with string type to int form in pyspark data frame? Display/Print one column from a DataFrame of Series in Pandas Binning column with python pandas Selection with .loc in python Set value to an entire column of a pandas dataframe

Examples related to r-faq

What does "The following object is masked from 'package:xxx'" mean? What does "Error: object '<myvariable>' not found" mean? How do I deal with special characters like \^$.?*|+()[{ in my regex? What does %>% function mean in R? How to plot a function curve in R Use dynamic variable names in `dplyr` Error: unexpected symbol/input/string constant/numeric constant/SPECIAL in my code How should I deal with "package 'xxx' is not available (for R version x.y.z)" warning? How to select the row with the maximum value in each group R data formats: RData, Rda, Rds etc