[r] How to select the rows with maximum values in each group with dplyr?

I would like to select a row with maximum value in each group with dplyr.

Firstly I generate some random data to show my question

set.seed(1)
df <- expand.grid(list(A = 1:5, B = 1:5, C = 1:5))
df$value <- runif(nrow(df))

In plyr, I could use a custom function to select this row.

library(plyr)
ddply(df, .(A, B), function(x) x[which.max(x$value),])

In dplyr, I am using this code to get the maximum value, but not the rows with maximum value (Column C in this case).

library(dplyr)
df %>% group_by(A, B) %>%
    summarise(max = max(value))

How could I achieve this? Thanks for any suggestion.

sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252   
[3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C                      
[5] LC_TIME=English_Australia.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.2  plyr_1.8.1

loaded via a namespace (and not attached):
[1] assertthat_0.1.0.99 parallel_3.1.0      Rcpp_0.11.1        
[4] tools_3.1.0        

This question is related to r dplyr plyr greatest-n-per-group

The answer is


This more verbose solution provides greater control on what happens in case of duplicate maximum value (in this example, it will take one of the corresponding rows randomly)

library(dplyr)
df %>% group_by(A, B) %>%
  mutate(the_rank  = rank(-value, ties.method = "random")) %>%
  filter(the_rank == 1) %>% select(-the_rank)

df %>% group_by(A,B) %>% slice(which.max(value))

For me, it helped to count the number of values per group. Copy the count table into a new object. Then filter for the max of the group based on the first grouping characteristic. For example:

count_table  <- df %>%
                group_by(A, B) %>%
                count() %>%
                arrange(A, desc(n))

count_table %>% 
    group_by(A) %>%
    filter(n == max(n))

or

count_table %>% 
    group_by(A) %>%
    top_n(1, n)

More generally, I think you might want to get "top" of the rows that are sorted within a given group.

For the case of where a single value is max'd out, you have essentially sorted by only one column. However, it's often useful to hierarchically sort by multiple columns (for example: a date column and a time-of-day column).

# Answering the question of getting row with max "value".
df %>% 
  # Within each grouping of A and B values.
  group_by( A, B) %>% 
  # Sort rows in descending order by "value" column.
  arrange( desc(value) ) %>% 
  # Pick the top 1 value
  slice(1) %>% 
  # Remember to ungroup in case you want to do further work without grouping.
  ungroup()

# Answering an extension of the question of 
# getting row with the max value of the lowest "C".
df %>% 
  # Within each grouping of A and B values.
  group_by( A, B) %>% 
  # Sort rows in ascending order by C, and then within that by 
  # descending order by "value" column.
  arrange( C, desc(value) ) %>% 
  # Pick the one top row based on the sort
  slice(1) %>% 
  # Remember to ungroup in case you want to do further work without grouping.
  ungroup()

You can use top_n

df %>% group_by(A, B) %>% top_n(n=1)

This will rank by the last column (value) and return the top n=1 rows.

Currently, you can't change the this default without causing an error (See https://github.com/hadley/dplyr/issues/426)


Examples related to r

How to get AIC from Conway–Maxwell-Poisson regression via COM-poisson package in R? R : how to simply repeat a command? session not created: This version of ChromeDriver only supports Chrome version 74 error with ChromeDriver Chrome using Selenium How to show code but hide output in RMarkdown? remove kernel on jupyter notebook Function to calculate R2 (R-squared) in R Center Plot title in ggplot2 R ggplot2: stat_count() must not be used with a y aesthetic error in Bar graph R multiple conditions in if statement What does "The following object is masked from 'package:xxx'" mean?

Examples related to dplyr

R dplyr: Drop multiple columns How to specify "does not contain" in dplyr filter Select first and last row from grouped data Error: could not find function "%>%" Sum across multiple columns with dplyr Removing NA observations with dplyr::filter() Changing factor levels with dplyr mutate Change value of variable with dplyr dplyr change many data types What does %>% function mean in R?

Examples related to plyr

Change value of variable with dplyr How to select the rows with maximum values in each group with dplyr? Count number of rows by group using dplyr Aggregate a dataframe on a given column and display another column

Examples related to greatest-n-per-group

How to select the rows with maximum values in each group with dplyr? MAX function in where clause mysql Pandas get topmost n records within each group SQL Left Join first match only Select info from table where row has max date FORCE INDEX in MySQL - where do I put it? GROUP BY having MAX date How can I select rows with most recent timestamp for each key value? Select row with most recent date per user How to select id with max date group by category in PostgreSQL?