In a dataset with multiple observations for each subject I want to take a subset with only the maximum data value for each record. For example, with a following dataset:
ID <- c(1,1,1,2,2,2,2,3,3)
Value <- c(2,3,5,2,5,8,17,3,5)
Event <- c(1,1,2,1,2,1,2,2,2)
group <- data.frame(Subject=ID, pt=Value, Event=Event)
Subject 1, 2, and 3 have the biggest pt value of 5, 17, and 5 respectively.
How could I first find the biggest pt value for each subject, and then, put this observation in another data frame? The resulting data frame should only have the biggest pt values for each subject.
Another data.table
solution:
library(data.table)
setDT(group)[, head(.SD[order(-pt)], 1), by = .(Subject)]
Since {dplyr} v1.0.0 (May 2020) there is the new slice_*
syntax which supersedes top_n()
.
See also https://dplyr.tidyverse.org/reference/slice.html.
library(tidyverse)
ID <- c(1,1,1,2,2,2,2,3,3)
Value <- c(2,3,5,2,5,8,17,3,5)
Event <- c(1,1,2,1,2,1,2,2,2)
group <- data.frame(Subject=ID, pt=Value, Event=Event)
group %>%
group_by(Subject) %>%
slice_max(pt)
#> # A tibble: 3 x 3
#> # Groups: Subject [3]
#> Subject pt Event
#> <dbl> <dbl> <dbl>
#> 1 1 5 2
#> 2 2 17 2
#> 3 3 5 2
Created on 2020-08-18 by the reprex package (v0.3.0.9001)
Session infosessioninfo::session_info()
#> - Session info ---------------------------------------------------------------
#> setting value
#> version R version 4.0.2 Patched (2020-06-30 r78761)
#> os macOS Catalina 10.15.6
#> system x86_64, darwin17.0
#> ui X11
#> language (EN)
#> collate en_US.UTF-8
#> ctype en_US.UTF-8
#> tz Europe/Berlin
#> date 2020-08-18
#>
#> - Packages -------------------------------------------------------------------
#> package * version date lib source
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.0)
#> backports 1.1.8 2020-06-17 [1] CRAN (R 4.0.1)
#> blob 1.2.1 2020-01-20 [1] CRAN (R 4.0.0)
#> broom 0.7.0 2020-07-09 [1] CRAN (R 4.0.2)
#> cellranger 1.1.0 2016-07-27 [1] CRAN (R 4.0.0)
#> cli 2.0.2 2020-02-28 [1] CRAN (R 4.0.0)
#> colorspace 1.4-1 2019-03-18 [1] CRAN (R 4.0.0)
#> crayon 1.3.4 2017-09-16 [1] CRAN (R 4.0.0)
#> DBI 1.1.0 2019-12-15 [1] CRAN (R 4.0.0)
#> dbplyr 1.4.4 2020-05-27 [1] CRAN (R 4.0.0)
#> digest 0.6.25 2020-02-23 [1] CRAN (R 4.0.0)
#> dplyr * 1.0.1 2020-07-31 [1] CRAN (R 4.0.2)
#> ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.0)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.0)
#> fansi 0.4.1 2020-01-08 [1] CRAN (R 4.0.0)
#> forcats * 0.5.0 2020-03-01 [1] CRAN (R 4.0.0)
#> fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.2)
#> generics 0.0.2 2018-11-29 [1] CRAN (R 4.0.0)
#> ggplot2 * 3.3.2 2020-06-19 [1] CRAN (R 4.0.1)
#> glue 1.4.1 2020-05-13 [1] CRAN (R 4.0.0)
#> gtable 0.3.0 2019-03-25 [1] CRAN (R 4.0.0)
#> haven 2.3.1 2020-06-01 [1] CRAN (R 4.0.0)
#> highr 0.8 2019-03-20 [1] CRAN (R 4.0.0)
#> hms 0.5.3 2020-01-08 [1] CRAN (R 4.0.0)
#> htmltools 0.5.0 2020-06-16 [1] CRAN (R 4.0.1)
#> httr 1.4.2 2020-07-20 [1] CRAN (R 4.0.2)
#> jsonlite 1.7.0 2020-06-25 [1] CRAN (R 4.0.2)
#> knitr 1.29 2020-06-23 [1] CRAN (R 4.0.2)
#> lifecycle 0.2.0 2020-03-06 [1] CRAN (R 4.0.0)
#> lubridate 1.7.9 2020-06-08 [1] CRAN (R 4.0.1)
#> magrittr 1.5 2014-11-22 [1] CRAN (R 4.0.0)
#> modelr 0.1.8 2020-05-19 [1] CRAN (R 4.0.0)
#> munsell 0.5.0 2018-06-12 [1] CRAN (R 4.0.0)
#> pillar 1.4.6 2020-07-10 [1] CRAN (R 4.0.2)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.0)
#> purrr * 0.3.4 2020-04-17 [1] CRAN (R 4.0.0)
#> R6 2.4.1 2019-11-12 [1] CRAN (R 4.0.0)
#> Rcpp 1.0.5 2020-07-06 [1] CRAN (R 4.0.2)
#> readr * 1.3.1 2018-12-21 [1] CRAN (R 4.0.0)
#> readxl 1.3.1 2019-03-13 [1] CRAN (R 4.0.0)
#> reprex 0.3.0.9001 2020-08-13 [1] Github (tidyverse/reprex@23a3462)
#> rlang 0.4.7 2020-07-09 [1] CRAN (R 4.0.2)
#> rmarkdown 2.3.3 2020-07-26 [1] Github (rstudio/rmarkdown@204aa41)
#> rstudioapi 0.11 2020-02-07 [1] CRAN (R 4.0.0)
#> rvest 0.3.6 2020-07-25 [1] CRAN (R 4.0.2)
#> scales 1.1.1 2020-05-11 [1] CRAN (R 4.0.0)
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.2)
#> stringi 1.4.6 2020-02-17 [1] CRAN (R 4.0.0)
#> stringr * 1.4.0 2019-02-10 [1] CRAN (R 4.0.0)
#> styler 1.3.2.9000 2020-07-05 [1] Github (pat-s/styler@51d5200)
#> tibble * 3.0.3 2020-07-10 [1] CRAN (R 4.0.2)
#> tidyr * 1.1.1 2020-07-31 [1] CRAN (R 4.0.2)
#> tidyselect 1.1.0 2020-05-11 [1] CRAN (R 4.0.0)
#> tidyverse * 1.3.0 2019-11-21 [1] CRAN (R 4.0.0)
#> utf8 1.1.4 2018-05-24 [1] CRAN (R 4.0.0)
#> vctrs 0.3.2 2020-07-15 [1] CRAN (R 4.0.2)
#> withr 2.2.0 2020-04-20 [1] CRAN (R 4.0.0)
#> xfun 0.16 2020-07-24 [1] CRAN (R 4.0.2)
#> xml2 1.3.2 2020-04-23 [1] CRAN (R 4.0.0)
#> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.0)
#>
#> [1] /Users/pjs/Library/R/4.0/library
#> [2] /Library/Frameworks/R.framework/Versions/4.0/Resources/library
One more base R solution:
merge(aggregate(pt ~ Subject, max, data = group), group)
Subject pt Event
1 1 5 2
2 2 17 2
3 3 5 2
A dplyr
solution:
library(dplyr)
ID <- c(1,1,1,2,2,2,2,3,3)
Value <- c(2,3,5,2,5,8,17,3,5)
Event <- c(1,1,2,1,2,1,2,2,2)
group <- data.frame(Subject=ID, pt=Value, Event=Event)
group %>%
group_by(Subject) %>%
summarize(max.pt = max(pt))
This yields the following data frame:
Subject max.pt
1 1 5
2 2 17
3 3 5
A shorter solution using data.table
:
setDT(group)[, .SD[which.max(pt)], by=Subject]
# Subject pt Event
# 1: 1 5 2
# 2: 2 17 2
# 3: 3 5 2
In base you can use ave
to get max
per group and compare this with pt
and get a logical vector to subset the data.frame
.
group[group$pt == ave(group$pt, group$Subject, FUN=max),]
# Subject pt Event
#3 1 5 2
#7 2 17 2
#9 3 5 2
Or compare it already in the function.
group[as.logical(ave(group$pt, group$Subject, FUN=function(x) x==max(x))),]
#group[ave(group$pt, group$Subject, FUN=function(x) x==max(x))==1,] #Variant
# Subject pt Event
#3 1 5 2
#7 2 17 2
#9 3 5 2
The most intuitive method is to use group_by and top_n function in dplyr
group %>% group_by(Subject) %>% top_n(1, pt)
The result you get is
Source: local data frame [3 x 3]
Groups: Subject [3]
Subject pt Event
(dbl) (dbl) (dbl)
1 1 5 2
2 2 17 2
3 3 5 2
Another option is slice
library(dplyr)
group %>%
group_by(Subject) %>%
slice(which.max(pt))
# Subject pt Event
# <dbl> <dbl> <dbl>
#1 1 5 2
#2 2 17 2
#3 3 5 2
If you want the biggest pt value for a subject, you could simply use:
pt_max = as.data.frame(aggregate(pt~Subject, group, max))
Another data.table
option:
library(data.table)
setDT(group)
group[group[order(-pt), .I[1L], Subject]$V1]
Or another (less readable but slightly faster):
group[group[, rn := .I][order(Subject, -pt), {
rn[c(1L, 1L + which(diff(Subject)>0L))]
}]]
timing code:
library(data.table)
nr <- 1e7L
ng <- nr/4L
set.seed(0L)
DT <- data.table(Subject=sample(ng, nr, TRUE), pt=1:nr)#rnorm(nr))
DT2 <- copy(DT)
microbenchmark::microbenchmark(times=3L,
mtd0 = {a0 <- DT[DT[, .I[which.max(pt)], by=Subject]$V1]},
mtd1 = {a1 <- DT[DT[order(-pt), .I[1L], Subject]$V1]},
mtd2 = {a2 <- DT2[DT2[, rn := .I][
order(Subject, -pt), rn[c(TRUE, diff(Subject)>0L)]
]]},
mtd3 = {a3 <- unique(DT[order(Subject, -pt)], by="Subject")}
)
fsetequal(a0[order(Subject)], a1[order(Subject)])
#[1] TRUE
fsetequal(a0[order(Subject)], a2[, rn := NULL][order(Subject)])
#[1] TRUE
fsetequal(a0[order(Subject)], a3[order(Subject)])
#[1] TRUE
timings:
Unit: seconds
expr min lq mean median uq max neval
mtd0 3.256322 3.335412 3.371439 3.414502 3.428998 3.443493 3
mtd1 1.733162 1.748538 1.786033 1.763915 1.812468 1.861022 3
mtd2 1.136307 1.159606 1.207009 1.182905 1.242359 1.301814 3
mtd3 1.123064 1.166161 1.228058 1.209257 1.280554 1.351851 3
Another base solution
group_sorted <- group[order(group$Subject, -group$pt),]
group_sorted[!duplicated(group_sorted$Subject),]
# Subject pt Event
# 1 5 2
# 2 17 2
# 3 5 2
Order the data frame by pt
(descending) and then remove rows duplicated in Subject
by
is a version of tapply
for data frames:
res <- by(group, group$Subject, FUN=function(df) df[which.max(df$pt),])
It returns an object of class by
so we convert it to data frame:
do.call(rbind, b)
Subject pt Event
1 1 5 2
2 2 17 2
3 3 5 2
Using dplyr 1.0.2 there are now two ways to do this, one is long hand and the other is using the verb across():
# create data
ID <- c(1,1,1,2,2,2,2,3,3)
Value <- c(2,3,5,2,5,8,17,3,5)
Event <- c(1,1,2,1,2,1,2,2,2)
group <- data.frame(Subject=ID, pt=Value, Event=Event)
Long hand the verb is max() but note the na.rm = TRUE which is useful for examples where there are NAs as in the closed question: Merge rows in a dataframe where the rows are disjoint and contain NAs:
group %>%
group_by(Subject) %>%
summarise(pt = max(pt, na.rm = TRUE),
Event = max(Event, na.rm = TRUE))
This is ok if there are only a few columns but if the table has many columns across() is useful. The examples for this verb are often with summarise(across(start_with... but in this example the columns don't start with the same characters. Either they could be changed or the positions listed:
group %>%
group_by(Subject) %>%
summarise(across(1:ncol(group)-1, max, na.rm = TRUE, .names = "{.col}"))
Note for the verb across() 1 refers to the first column after the first actual column so using ncol(group) won't work as that is too many columns (makes it position 4 rather than 3).
do.call(rbind, lapply(split(group,as.factor(group$Subject)), function(x) {return(x[which.max(x$pt),])}))
Using Base R
I wasn't sure what you wanted to do about the Event column, but if you want to keep that as well, how about
isIDmax <- with(dd, ave(Value, ID, FUN=function(x) seq_along(x)==which.max(x)))==1
group[isIDmax, ]
# ID Value Event
# 3 1 5 2
# 7 2 17 2
# 9 3 5 2
Here we use ave
to look at the "Value" column for each "ID". Then we determine which value is the maximal and then turn that into a logical vector we can use to subset the original data.frame.
Here's another data.table
solution, since which.max
does not work on characters
library(data.table)
group <- data.table(Subject=ID, pt=Value, Event=Event)
group[, .SD[order(pt, decreasing = TRUE) == 1], by = Subject]
Source: Stackoverflow.com