Compute mean and standard deviation by group for multiple variables in a data frame

Question

Edit -- This question was originally titled  lt  lt  Long to wide data reshaping in R       I m just learning R and trying to find ways to apply it to help out others in my life  As a test case  I m working on reshaping some data  and I m having trouble following the examples I ve found online  What I m starting with looks like this   ID  Obs 1   Obs 2   Obs 3 1   43      48      37 1   27      29      22 1   36      32      40 2   33      38      36 2   29      32      27 2   32      31      35 2   25      28      24 3   45      47      42 3   38      40      36   And what I want to end up with will look like this   ID  Obs 1 mean  Obs 1 std dev   Obs 2 mean  Obs 2 std dev 1   x           x               x           x 2   x           x               x           x 3   x           x               x           x   And so forth  What I m unsure of is whether I need additional information in my long-form data  or what  I imagine that the math part  finding the mean and standard deviations  will be the easy part  but I haven t been able to find a way that seems to work to reshape the data correctly to start in on that process   Thanks very much for any help

User · Answer

There is a helpful function in the psych package   You should try the following implementation   psych  describeBy data dependentvariable  group   data groupingvariable

User · Answer

This is an aggregation problem  not a reshaping problem as the question originally suggested -- we wish to aggregate each column into a mean and standard deviation by ID   There are many packages that handle such problems   In the base of R it can be done using aggregate like this  assuming DF is the input data frame    ag  lt - aggregate     ID  DF  function x  c mean   mean x   sd   sd x      Note 1  A commenter pointed out that ag is a data frame for which some columns are matrices   Although initially that may seem strange  in fact it simplifies access   ag has the same number of columns as the input DF   Its first column ag  1   is ID and the ith column of the remainder ag  i 1    or equivalanetly ag -1   i    is the matrix of statistics for the ith input observation column   If one wishes to access the jth statistic of the ith observation it is therefore ag  i 1     j  which can also be written as ag -1   i     j      On the other hand  suppose there are k statistic columns for each observation in the input  where k 2 in the question    Then if we flatten the output then to access the jth statistic of the ith observation column we must use the more complex ag  k  i-1  j 1   or equivalently ag -1   k  i-1  j      For example  compare the simplicity of the first expression vs  the second   ag -1   2             mean      sd     1   36 333 10 2144     2   32 250  4 1932     3   43 500  4 9497  ag flat  lt - do call  data frame   ag    flatten ag flat -1    2    2-1    1 2       Obs 2 mean Obs 2 sd    1     36 333  10 2144    2     32 250   4 1932    3     43 500   4 9497   Note 2  The input in reproducible form is   Lines  lt -  ID  Obs 1   Obs 2   Obs 3 1   43      48      37 1   27      29      22 1   36      32      40 2   33      38      36 2   29      32      27 2   32      31      35 2   25      28      24 3   45      47      42 3   38      40      36  DF  lt - read table text   Lines  header   TRUE

User · Answer

Here s another take on the data table answers  using  Carson s data  that s a bit more readable  and also a little faster  because of using lapply instead of sapply    library data table  set seed 1  dt   data table ID c 1 3   Obs 1 rnorm 9   Obs 2 rnorm 9   Obs 3 rnorm 9    dt   c mean   lapply  SD  mean   sd   lapply  SD  sd    by   ID      ID mean Obs 1 mean Obs 2 mean Obs 3  sd Obs 1  sd Obs 2  sd Obs 3  1   1  0 4854187 -0 3238542  0 7410611 1 1108687 0 2885969 0 1067961  2   2  0 4171586 -0 2397030  0 2041125 0 2875411 1 8732682 0 3438338  3   3 -0 3601052  0 8195368 -0 4087233 0 8105370 0 3829833 1 4705692

User · Answer

I add the dplyr solution   set seed 1  df  lt - data frame ID rep 1 3  3   Obs 1 rnorm 9   Obs 2 rnorm 9   Obs 3 rnorm 9    library dplyr  df   gt   group by ID    gt   summarise each funs mean  sd           ID Obs 1 mean Obs 2 mean Obs 3 mean  Obs 1 sd  Obs 2 sd  Obs 3 sd      int        dbl        dbl        dbl       dbl       dbl       dbl    1     1  0 4854187 -0 3238542  0 7410611 1 1108687 0 2885969 0 1067961   2     2  0 4171586 -0 2397030  0 2041125 0 2875411 1 8732682 0 3438338   3     3 -0 3601052  0 8195368 -0 4087233 0 8105370 0 3829833 1 4705692

User · Answer

Here is probably the simplest way to go about it  with a reproducible example    library plyr  df  lt - data frame ID rep 1 3  3   Obs 1 rnorm 9   Obs 2 rnorm 9   Obs 3 rnorm 9   ddply df    ID   summarize  Obs 1 mean mean Obs 1   Obs 1 std dev sd Obs 1     Obs 2 mean mean Obs 2   Obs 2 std dev sd Obs 2       ID  Obs 1 mean Obs 1 std dev  Obs 2 mean Obs 2 std dev 1  1 -0 13994642     0 8258445 -0 15186380     0 4251405 2  2  1 49982393     0 2282299  0 50816036     0 5812907 3  3 -0 09269806     0 6115075 -0 01943867     1 3348792   EDIT  The following approach saves you a lot of typing when dealing with many columns   ddply df    ID   colwise mean      ID      Obs 1      Obs 2      Obs 3 1  1 -0 3748831  0 1787371  1 0749142 2  2 -1 0363973  0 0157575 -0 8826969 3  3  1 0721708 -1 1339571 -0 5983944  ddply df    ID   colwise sd      ID     Obs 1     Obs 2     Obs 3 1  1 0 8732498 0 4853133 0 5945867 2  2 0 2978193 1 0451626 0 5235572 3  3 0 4796820 0 7563216 1 4404602

User · Answer

The updated dplyr solution  as for 2020 1  summarise each    is deprecated as of dplyr 0 7 0  and 2  funs   is deprecated as of dplyr 0 8 0  ag dplyr  lt - DF   gt   group by ID    gt   summarise across  cols   everything   list mean   mean  sd   sd

User · Answer

There are a few different ways to go about it   reshape2 is a helpful package   Personally  I like using data table  Below is a step-by-step  If myDF is your data frame    library data table  DT  lt - data table myDF   DT    this will get you your mean and SD s for each column DT   sapply  SD  function x  list mean mean x   sd sd x        adding a  by  argument will give you the groupings DT   sapply  SD  function x  list mean mean x   sd sd x     by ID     If you would like to round the values   DT   sapply  SD  function x  list mean round mean x   3   sd round sd x   3     by ID     If we want to add names to the columns  wide  lt - setnames DT   sapply  SD  function x  list mean round mean x   3   sd round sd x   3     by ID   c  ID   sapply names DT  -1   paste0  c   men     SD       wide     ID Obs 1 men Obs 1 SD Obs 2 men Obs 2 SD Obs 3 men Obs 3 SD 1   1    35 333    8 021    36 333   10 214      33 0    9 644 2   2    29 750    3 594    32 250    4 193      30 5    5 916 3   3    41 500    4 950    43 500    4 950      39 0    4 243     Also  this may or may not be helpful   gt  DT   sapply  SD  summary    SDcols names DT  -1           Obs 1 Obs 2 Obs 3 Min     25 00 28 00 22 00 1st Qu  29 00 31 00 27 00 Median  33 00 32 00 36 00 Mean    34 22 36 11 33 22 3rd Qu  38 00 40 00 37 00 Max     45 00 48 00 42 00

[r] Compute mean and standard deviation by group for multiple variables in a data.frame

Examples related to r

Examples related to aggregate

Examples related to reshape

Examples related to reshape2