Mean per group in a data frame

Question

I have a data frame and I need to calculate the mean per group  i e  per Month  below    Name     Month  Rate1     Rate2 Aira       1      12        23 Aira       2      18        73 Aira       3      19        45 Ben        1      53        19 Ben        2      22        87 Ben        3      19        45 Cat        1      22        87 Cat        2      67        43 Cat        3      45        32   My desired output is like below  where the values for Rate1 and Rate2 are the group means  Please disregard the value  I have made it up for the example   Name       Rate1       Rate2 Aira        23 21       12 2 Ben         45 23       43 9 Cat         33 22       32 2

User · Accepted Answer

This type of operation is exactly what aggregate was designed for   d  lt - read table text   Name     Month  Rate1     Rate2 Aira       1      12        23 Aira       2      18        73 Aira       3      19        45 Ben        1      53        19 Ben        2      22        87 Ben        3      19        45 Cat        1      22        87 Cat        2      67        43 Cat        3      45        32   header TRUE   aggregate d   3 4   list d Name   mean     Group 1    Rate1    Rate2 1    Aira 16 33333 47 00000 2     Ben 31 33333 50 33333 3     Cat 44 66667 54 00000   Here we aggregate columns 3 and 4 of data frame d  grouping by d Name  and applying the mean function     Or  using a formula interface   aggregate     Name  d -2   mean

User · Answer

I describe two ways to do this  one based on data table and the other based on reshape2 package   The data table way already has an answer  but I have tried to make it cleaner and more detailed   The data is like this     d  lt - structure list Name   structure c 1L  1L  1L  2L  2L  2L  3L   3L  3L    Label   c  Aira    Ben    Cat    class    factor         Month   c 1L  2L  3L  1L  2L  3L  1L  2L  3L   Rate1   c 12L       18L  19L  53L  22L  19L  22L  67L  45L   Rate2   c 23L  73L       45L  19L  87L  45L  87L  43L  32L     Names   c  Name    Month     Rate1    Rate2    class    data frame   row names   c NA  -9L    head d    Name Month Rate1 Rate2 1 Aira     1    12    23 2 Aira     2    18    73 3 Aira     3    19    45 4  Ben     1    53    19 5  Ben     2    22    87 6  Ben     3    19    45   library  reshape2   mym  lt - melt d  id   c  Name    res  lt - dcast mym  Name   variable  mean  res  Name Month    Rate1    Rate2  1 Aira     2 16 33333 47 00000  2  Ben     2 31 33333 50 33333  3  Cat     2 44 66667 54 00000   Using data table     At first  I convert the data frame to data table and then I group it  setDT d  d     Rate1   mean Rate1   Rate2   mean Rate2    by     Name       Name    Rate1    Rate2  1  Aira 16 33333 47 00000  2   Ben 31 33333 50 33333  3   Cat 44 66667 54 00000   There is another way of doing it by avoiding to write many argument for j in data table using a  SD        d   lapply  SD  mean   by     Name       Name Month    Rate1    Rate2  1  Aira     2 16 33333 47 00000  2   Ben     2 31 33333 50 33333  3   Cat     2 44 66667 54 00000   if we only want to have Rate1 and Rate2 then we can use the  SDcols as follows    d   lapply  SD  mean   by     Name    SDcols   3 4     Name    Rate1    Rate2  1  Aira 16 33333 47 00000  2   Ben 31 33333 50 33333  3   Cat 44 66667 54 00000

User · Answer

Or use group by  amp  summarise at from the dplyr package   library dplyr   d   gt     group by Name    gt     summarise at vars -Month   funs mean    na rm TRUE       A tibble  3 x 3   Name  Rate1 Rate2    lt fct gt   lt dbl gt   lt dbl gt  1 Aira   16 3  47 0 2 Ben    31 3  50 3 3 Cat    44 7  54 0   See  summarise at for the many ways to specify the variables to act on  Here  vars -Month  says all variables except Month

User · Answer

You could also use the generic function cbind   and lm   without the intercept   cbind lm d Rate1 -1 d Name  coef lm d Rate2 -1 d Name  coef   gt                  1        2   gt d NameAira 16 33333 47 00000  gt d NameBen  31 33333 50 33333  gt d NameCat  44 66667 54 00000

User · Answer

A third great alternative is using the package data table  which also has the class data frame  but operations like you are looking for are computed much faster   library data table  mydt  lt - structure list Name   c  Aira    Aira    Aira    Ben    Ben    Ben    Cat    Cat    Cat    Month   c 1L  2L  3L  1L  2L  3L  1L  2L  3L   Rate1   c 15 6396600443877  2 15649279424609  6 24692918928743  2 37658797276116  34 7500663272292  3 28750138697048  29 3265553981065  17 9821839334431  10 8639802575958   Rate2   c 17 1680489538369  5 84231656330206  8 54330866437461  5 88415184986176  3 02064294862551  17 2053351400752  16 9552950199166  2 56058000170089  15 7496228048122     Names   c  Name    Month    Rate1    Rate2    row names   c NA  -9L   class   c  data table    data frame      Now to take the mean of Rate1 and Rate2 for all 3 months  for each person  Name   First  decide which columns you want to take the mean of  colstoavg  lt - names mydt  3 4    Now we use lapply to take the mean over the columns we want to avg  colstoavg   mydt mean  lt - mydt  lapply  SD mean na rm TRUE  by Name  SDcols colstoavg    mydt mean    Name     Rate1     Rate2 1  Aira  8 014361 10 517891 2   Ben 13 471385  8 703377 3   Cat 19 390907 11 755166

User · Answer

You can also accomplish this using the sqldf package as shown below    library sqldf   x  lt - read table text  Name     Month  Rate1     Rate2 Aira       1      12        23                 Aira       2      18        73                 Aira       3      19        45                 Ben        1      53        19                 Ben        2      22        87                 Ben        3      19        45                 Cat        1      22        87                 Cat        2      67        43                 Cat        3      45        32   header TRUE   sqldf   select    Name    avg Rate1  as Rate1 float    avg Rate2  as Rate2 float    avg Rate1  as Rate1    avg Rate2  as Rate2 from x group by    Name        Name Rate1 float Rate2 float Rate1 Rate2  1 Aira    16 33333    47 00000    16    47  2  Ben    31 33333    50 33333    31    50  3  Cat    44 66667    54 00000    44    54   I am a recent convert to dplyr as shown in other answers  but sqldf is nice as most data analysts data scientists developers have at least some fluency in SQL  In this way  I think it tends to make for more universally readable code than dplyr or other solutions presented above    UPDATE  In responding to the comment below  I attempted to update the code as shown above  However  the behavior was not as I expected  It seems that the column definition  i e  int vs float  is only carried through when the column alias matches the original column name  When you specify a new name  the aggregate column is returned without rounding

User · Answer

Here are a variety of ways to do this in base R including an alternative aggregate approach   The examples below return means per month  which I think is what you requested   Although  the same approach could be used to return means per person  Using ave  my data  lt - read table text          Name     Month  Rate1     Rate2      Aira       1      12        23      Aira       2      18        73      Aira       3      19        45      Ben        1      53        19      Ben        2      22        87      Ben        3      19        45      Cat        1      22        87      Cat        2      67        43      Cat        3      45        32    header   TRUE  stringsAsFactors   FALSE  na strings    NA    Rate1 mean  lt - with my data  ave Rate1  Month  FUN   function x  mean x  na rm   TRUE    Rate2 mean  lt - with my data  ave Rate2  Month  FUN   function x  mean x  na rm   TRUE     my data  lt - data frame my data  Rate1 mean  Rate2 mean  my data  Using by  my data  lt - read table text          Name     Month  Rate1     Rate2      Aira       1      12        23      Aira       2      18        73      Aira       3      19        45      Ben        1      53        19      Ben        2      22        87      Ben        3      19        45      Cat        1      22        87      Cat        2      67        43      Cat        3      45        32    header   TRUE  stringsAsFactors   FALSE  na strings    NA    by month  lt - as data frame do call  quot rbind quot   by my data  my data Month  FUN   function x  colMeans x  3 4      colnames by month   lt - c  Rate1 mean    Rate2 mean   by month  lt - cbind Month   rownames by month   by month   my data  lt - merge my data  by month  by    Month   my data  Using lapply and split  my data  lt - read table text          Name     Month  Rate1     Rate2      Aira       1      12        23      Aira       2      18        73      Aira       3      19        45      Ben        1      53        19      Ben        2      22        87      Ben        3      19        45      Cat        1      22        87      Cat        2      67        43      Cat        3      45        32    header   TRUE  stringsAsFactors   FALSE  na strings    NA    ly mean  lt - lapply split my data  my data Month   function x  c Mean   colMeans x  3 4     ly mean  lt - as data frame do call  quot rbind quot   ly mean   ly mean  lt - cbind Month   rownames ly mean   ly mean   my data  lt - merge my data  ly mean  by    Month   my data  Using sapply and split  my data  lt - read table text          Name     Month  Rate1     Rate2      Aira       1      12        23      Aira       2      18        73      Aira       3      19        45      Ben        1      53        19      Ben        2      22        87      Ben        3      19        45      Cat        1      22        87      Cat        2      67        43      Cat        3      45        32    header   TRUE  stringsAsFactors   FALSE  na strings    NA   my data  sy mean  lt - t sapply split my data  my data Month   function x  colMeans x  3 4     colnames sy mean   lt - c  Rate1 mean    Rate2 mean   sy mean  lt - data frame Month   rownames sy mean   sy mean  stringsAsFactors   FALSE  my data  lt - merge my data  sy mean  by    Month   my data  Using aggregate  my data  lt - read table text          Name     Month  Rate1     Rate2      Aira       1      12        23      Aira       2      18        73      Aira       3      19        45      Ben        1      53        19      Ben        2      22        87      Ben        3      19        45      Cat        1      22        87      Cat        2      67        43      Cat        3      45        32    header   TRUE  stringsAsFactors   FALSE  na strings    NA    my summary  lt - with my data  aggregate list Rate1  Rate2   by   list Month                       FUN   function x    mon mean   mean x  na rm   TRUE        my summary  lt - do call data frame  my summary  colnames my summary   lt - c  Month    Rate1 mean    Rate2 mean   my summary  my data  lt - merge my data  my summary  by    Month   my data  EDIT  June 28  2020 Here I use aggregate to obtain the column means of an entire matrix by group where group is defined in an external vector  my group  lt - c 1 2 1 2 2 3 1 2 3 3   my data  lt - matrix c    1     2     3     4     5                        10    20    30    40    50                         2     4     6     8    10                        20    30    40    50    60                        20    18    16    14    12                      1000  1100  1200  1300  1400                         2     3     4     3     2                        50    40    30    20    10                      1001  2001  3001  4001  5001                      1000  2000  3000  4000  5000   nrow   10  ncol   5  byrow   TRUE  my data  my summary  lt - aggregate list my data   by   list my group   FUN   function x    my mean   mean x  na rm   TRUE      my summary    Group 1          X1       X2          X3       X4          X5  1       1    1 666667    3 000    4 333333    5 000    5 666667  2       2   25 000000   27 000   29 000000   31 000   33 000000  3       3 1000 333333 1700 333 2400 333333 3100 333 3800 333333

User · Answer

You can also use package plyr  which is somehow more versatile   library plyr   ddply d    Name   summarize   Rate1 mean Rate1   Rate2 mean Rate2      Name    Rate1    Rate2 1 Aira 16 33333 47 00000 2  Ben 31 33333 50 33333 3  Cat 44 66667 54 00000

[r] Mean per group in a data.frame

Examples related to r

Examples related to aggregate