Aggregate a dataframe on a given column and display another column

Question

I have a dataframe in R of the following form    gt  head data    Group Score Info 1     1     1    a 2     1     2    b 3     1     3    c 4     2     4    d 5     2     3    e 6     2     1    f   I would like to aggregate it following the Score column using the max function   gt  aggregate data Score  list data Group   max     Group 1         x 1       1         3 2       2         4   But I also would like to display the Info column associated to the maximum value of the Score column for each group  I have no idea how to do this  My desired output would be     Group 1         x        y 1       1         3        c 2       2         4        d   Any hint

User · Answer

Here is a solution using the plyr package     The following line of code essentially tells ddply to first group your data by Group  and then within each group returns a subset where the Score equals the maximum score in that group   library plyr  ddply data    Group   function x x x Score  max x Score         Group Score Info 1     1     3    c 2     2     4    d   And  as  SachaEpskamp points out  this can be further simplified to   ddply df    Group   function x x which max x Score         which also has the advantage that which max will return multiple max lines  if there are any

User · Answer

I don t have a high enough reputation to comment on Gavin Simpson s answer  but I wanted to warn that there seems to be a difference in the default treatment of missing values between the standard syntax and the formula syntax for aggregate       Create some data with missing values  a lt -data frame day rep 1 5  hour c 1 2 3 3 4  val c 1 NA 3 NA 5     day hour val 1   1    1   1 2   1    2  NA 3   1    3   3 4   1    3  NA 5   1    4   5   Standard syntax aggregate a val by list day a day hour a hour  mean na rm T    day hour   x 1   1    1   1 2   1    2 NaN 3   1    3   3 4   1    4   5   Formula syntax   Note the index for hour 2 has been silently dropped  aggregate val   hour   day data a mean na rm T    hour day val 1    1   1   1 2    3   1   3 3    4   1   5

User · Answer

A base R solution is to combine the output of aggregate   with a merge   step  I find the formula interface to aggregate   a little more useful than the standard interface  partly because the names on the output are nicer  so I ll use that   The aggregate   step is  maxs  lt - aggregate Score   Group  data   dat  FUN   max    and the merge   step is simply  merge maxs  dat    This gives us the desired output   R gt  maxs  lt - aggregate Score   Group  data   dat  FUN   max  R gt  merge maxs  dat    Group Score Info 1     1     3    c 2     2     4    d   You could  of course  stick this into a one-liner  the intermediary step was more for exposition    merge aggregate Score   Group  data   dat  FUN   max   dat    The main reason I used the formula interface is that it returns a data frame with the correct names for the merge step  these are the names of the columns from the original data set dat  We need to have the output of aggregate   have the correct names so that merge   knows which columns in the original and aggregated data frames match   The standard interface gives odd names  whichever way you call it   R gt  aggregate dat Score  list dat Group   max    Group 1 x 1       1 3 2       2 4 R gt  with dat  aggregate Score  list Group   max     Group 1 x 1       1 3 2       2 4   We can use merge   on those outputs  but we need to do more work telling R which columns match up

User · Answer

First  you split the data using split   split z z Group    Than  for each chunk  select the row with max Score   lapply split z z Group  function chunk  chunk which max chunk Score       Finally reduce back to a data frame do calling rbind   do call rbind lapply split z z Group  function chunk  chunk which max chunk Score        Result     Group Score Info 1     1     3    c 2     2     4    d   One line  no magic spells  fast  result has good names

User · Answer

To add to Gavin s answer  prior to the merge  it is possible to get aggregate to use proper names when not using the formula interface   aggregate data   score   drop F   list group data group   mean

User · Answer

The plyr package can be used for this  With the ddply   function you can split a data frame on one or more columns and apply a function and return a data frame  then with the summarize   function you can use the columns of the splitted data frame as variables to make the new data frame    dat  lt - read table textConnection  Group Score Info 1     1     1    a 2     1     2    b 3     1     3    c 4     2     4    d 5     2     3    e 6     2     1    f     library  plyr    ddply dat   Group  summarize      Max   max Score       Info   Info which max Score      Group Max Info 1     1   3    c 2     2   4    d

User · Answer

This is how I baseically think of the problem   my df  lt - data frame group   rep c 1 2   each   3            score   runif 6   info   letters 1 6   my agg  lt - with my df  aggregate score  list group   max   my df split  lt - with my df  split x   my df  f   group   my agg info  lt - unlist lapply my df split  FUN   function x                x which x score    max x score     info                 gt  my agg   Group 1         x info 1       1 0 9344336    a 2       2 0 7699763    e

User · Answer

A late answer  but and approach using data table  library data table  DT  lt - data table dat   DT    SD which max Score     by   Group    Or  if it is possible to have more than one equally highest score  DT    SD which Score    max Score      by   Group    Noting that  from  data table      SD is a data table containing the Subset of x s Data for each group  excluding the group column s

[r] Aggregate a dataframe on a given column and display another column

Examples related to r

Examples related to aggregate

Examples related to plyr

Examples related to greatest-n-per-group