Summarizing multiple columns with dplyr

Question

I m struggling a bit with the dplyr-syntax  I have a data frame with different variables and one grouping variable  Now I want to calculate the mean for each column within each group  using dplyr in R   df  lt - data frame      a   sample 1 5  n  replace   TRUE        b   sample 1 5  n  replace   TRUE        c   sample 1 5  n  replace   TRUE        d   sample 1 5  n  replace   TRUE        grp   sample 1 3  n  replace   TRUE    df   gt   group by grp    gt   summarise mean a     This gives me the mean for column  a  for each group indicated by  grp    My question is  is it possible to get the means for each column within each group at once  Or do I have to repeat df   gt   group by grp    gt   summarise mean a   for each column   What I would like to have is something like  df   gt   group by grp    gt   summarise mean a d      mean a d   does not work

User · Answer

You can simply pass more arguments to summarise   df   gt   group by grp    gt   summarise mean a   mean b   mean c   mean d     Source  local data frame  3 x 5     grp  mean a   mean b   mean c  mean d  1   1 2 500000 3 500000 2 000000     3 0 2   2 3 800000 3 200000 3 200000     2 8 3   3 3 666667 3 333333 2 333333     3 0

User · Answer

The dplyr package contains summarise all for this aim  library dplyr    summarise all was replaced with the summarise acrosss      syntax dplyr  gt  1 00 df   gt   group by grp    gt   summarise across everything    list mean      gt    A tibble  3 x 5   gt      grp     a     b     c     d   gt     lt int gt   lt dbl gt   lt dbl gt   lt dbl gt   lt dbl gt    gt  1     1  3 08  2 98  2 98  2 91   gt  2     2  3 03  3 04  2 97  2 87   gt  3     3  2 85  2 95  2 95  3 06  Alternatively  the purrrlyr package provides the same functionality  library purrrlyr  df   gt   slice rows  quot grp quot     gt   dmap mean    gt    A tibble  3 x 5   gt      grp     a     b     c     d   gt     lt int gt   lt dbl gt   lt dbl gt   lt dbl gt   lt dbl gt    gt  1     1  3 08  2 98  2 98  2 91   gt  2     2  3 03  3 04  2 97  2 87   gt  3     3  2 85  2 95  2 95  3 06  Also don t forget about data table  use keyby to sort sort groups   library data table  setDT df    lapply  SD  mean   keyby   grp    gt     grp        a        b        c        d   gt  1    1 3 079412 2 979412 2 979412 2 914706   gt  2    2 3 029126 3 038835 2 967638 2 873786   gt  3    3 2 854701 2 948718 2 951567 3 062678  Let s try to compare performance  library dplyr  library purrrlyr  library data table  library bench  set seed 123  n  lt - 10000 df  lt - data frame    a   sample 1 5  n  replace   TRUE      b   sample 1 5  n  replace   TRUE      c   sample 1 5  n  replace   TRUE      d   sample 1 5  n  replace   TRUE      grp   sample 1 3  n  replace   TRUE    dt  lt - setDT df  mark    dplyr   df   gt   group by grp    gt   summarise across everything    list mean       purrrlyr   df   gt   slice rows  quot grp quot     gt   dmap mean     data table   dt   lapply  SD  mean   keyby   grp     check   FALSE     gt    A tibble  3 x 6   gt    expression      min   median  itr sec  mem alloc  gc sec    gt     lt bch expr gt   lt bch tm gt   lt bch tm gt       lt dbl gt   lt bch byt gt      lt dbl gt    gt  1 dplyr        2 81ms   2 85ms      328         NA     17 3   gt  2 purrrlyr     7 96ms   8 04ms      123         NA     24 5   gt  3 data table 596 33  s 707 91  s     1409         NA     10 3

User · Answer

All the examples are great  but I figure I d add one more to show how working in a  tidy  format simplifies things  Right now the data frame is in  wide  format meaning the variables  a  through  d  are represented in columns  To get to a  tidy   or long  format  you can use gather   from the tidyr package which shifts the variables in columns  a  through  d  into rows  Then you use the group by   and summarize   functions to get the mean of each group  If you want to present the data in a wide format  just tack on an additional call to the spread   function        library tidyverse     Create reproducible df set seed 101  df  lt - tibble a     sample 1 5  10  replace T                 b     sample 1 5  10  replace T                 c     sample 1 5  10  replace T                 d     sample 1 5  10  replace T                 grp   sample 1 3  10  replace T      Convert to tidy format using gather df   gt       gather key   variable  value   value  a d    gt       group by grp  variable    gt       summarize mean   mean value     gt       spread variable  mean    gt  Source  local data frame  3 x 5    gt  Groups  grp  3    gt     gt      grp        a     b        c        d   gt     lt int gt      lt dbl gt   lt dbl gt      lt dbl gt      lt dbl gt    gt  1     1 3 000000   3 5 3 250000 3 250000   gt  2     2 1 666667   4 0 4 666667 2 666667   gt  3     3 3 333333   3 0 2 333333 2 333333

User · Answer

We can summarize by using summarize at  summarize all and summarize if on dplyr 0 7 4  We can set the multiple columns and functions by using vars and funs argument as below code  The left-hand side of funs formula is assigned to suffix of summarized vars  In the dplyr 0 7 4  summarise each and mutate each  is already deprecated  so we cannot use these functions   options scipen   100  dplyr width   Inf  dplyr print max   Inf   library dplyr  packageVersion  dplyr      1     0 7 4     set seed 123  df  lt - data frame    a   sample 1 5  10  replace T      b   sample 1 5  10  replace T      c   sample 1 5  10  replace T      d   sample 1 5  10  replace T      grp   as character sample 1 3  10  replace T     For convenience  specify character type    df   gt   group by grp    gt      summarise each  vars   letters 1 4                     funs   c mean  mean       summarise each    is deprecated    Use  summarise all      summarise at    or  summarise if    instead    To map  funs  over a selection of variables  use  summarise at      Error  Strings must match column names  Unknown columns  mean   You should change to the following code  The following codes all have the same result      summarise at df   gt   group by grp    gt      summarise at  vars   letters 1 4                   funs   c mean  mean     df   gt   group by grp    gt      summarise at  vars   names    1 4                   funs   c mean  mean     df   gt   group by grp    gt      summarise at  vars   vars a b c d                   funs   c mean  mean       summarise all df   gt   group by grp    gt      summarise all  funs   c mean  mean       summarise if df   gt   group by grp    gt      summarise if  predicate   function x  is numeric x                   funs   funs mean  mean      A tibble  3 x 5   grp a mean b mean c mean d mean    lt chr gt    lt dbl gt    lt dbl gt    lt dbl gt    lt dbl gt    1     1   2 80   3 00    3 6   3 00   2     2   4 25   2 75    4 0   3 75   3     3   3 00   5 00    1 0   2 00   You can also have multiple functions    df   gt   group by grp    gt      summarise at  vars   letters 1 2                   funs   c Mean  mean   Sd  sd      A tibble  3 x 5   grp a Mean b Mean      a Sd     b Sd    lt chr gt    lt dbl gt    lt dbl gt       lt dbl gt      lt dbl gt    1     1   2 80   3 00 1 4832397 1 870829   2     2   4 25   2 75 0 9574271 1 258306   3     3   3 00   5 00        NA       NA

User · Answer

For completeness  with dplyr v0 2 ddply with colwise will also do this    gt  ddply df    grp   colwise mean     grp        a    b        c        d 1   1 4 333333 4 00 1 000000 2 000000 2   2 2 000000 2 75 2 750000 2 750000 3   3 3 000000 4 00 4 333333 3 666667   but it is slower  at least in this case    gt  microbenchmark ddply df    grp   colwise mean                       df   gt   group by grp    gt   summarise each funs mean    Unit  milliseconds                                             expr      min       lq     mean                 ddply df    grp   colwise mean       3 278002 3 331744 3 533835  df   gt   group by grp    gt   summarise each funs mean   1 001789 1 031528 1 109337     median       uq      max neval  3 353633 3 378089 7 592209   100  1 121954 1 133428 2 292216   100

[r] Summarizing multiple columns with dplyr?

Examples related to r

Examples related to dplyr

Examples related to aggregate