Aggregate summarize multiple variables per group e g sum mean

Question

From a data frame  is there a easy way to aggregate  sum  mean  max et c  multiple variables simultaneously   Below are some sample data   library lubridate  days   365 2 date   seq as Date  2000-01-01    length   days  by    day   year   year date  month   month date  x1   cumsum rnorm days  0 05    x2   cumsum rnorm days  0 05   df1   data frame date  year  month  x1  x2    I would like to simultaneously aggregate the x1 and x2 variables from the df2 data frame by year and month   The following code aggregates the x1 variable  but is it also possible to simultaneously aggregate the x2 variable       aggregate variables by year month df2 aggregate x1   year month  data df1  sum  na rm TRUE  head df2    Any suggestions would be greatly appreciated

User · Answer

With the dplyr package  you can use summarise all  summarise at or summarise if functions to aggregate multiple variables simultaneously  For the example dataset you can do this as follows   library dplyr    summarising all non-grouping variables df2  lt - df1   gt   group by year  month    gt   summarise all sum     summarising a specific set of non-grouping variables df2  lt - df1   gt   group by year  month    gt   summarise at vars x1  x2   sum  df2  lt - df1   gt   group by year  month    gt   summarise at vars -date   sum     summarising a specific set of non-grouping variables using select helpers   see  select helpers for more options df2  lt - df1   gt   group by year  month    gt   summarise at vars starts with  x     sum  df2  lt - df1   gt   group by year  month    gt   summarise at vars matches     0-9      sum     summarising a specific set of non-grouping variables based on condition  class  df2  lt - df1   gt   group by year  month    gt   summarise if is numeric  sum    The result of the latter two options       year month        x1         x2     lt dbl gt   lt dbl gt       lt dbl gt        lt dbl gt  1   2000     1 -73 58134  -92 78595 2   2000     2 -57 81334 -152 36983 3   2000     3 122 68758  153 55243 4   2000     4 450 24980  285 56374 5   2000     5 678 37867  384 42888 6   2000     6 792 68696  530 28694 7   2000     7 908 58795  452 31222 8   2000     8 710 69928  719 35225 9   2000     9 725 06079  914 93687 10  2000    10 770 60304  863 39337       with 14 more rows   Note  summarise each is deprecated in favor of summarise all  summarise at and summarise if     As mentioned in my comment above  you can also use the recast function from the reshape2-package   library reshape2  recast df1  year   month   variable  sum  id var   c  date    year    month      which will give you the same result

User · Answer

With the dplyr version  gt   1 0 0  we can also use summarise to apply function on multiple columns with across library dplyr  df1   gt        group by year  month    gt       summarise across starts with  x    sum     A tibble  24 x 4   Groups    year  2       year month     x1     x2      lt dbl gt   lt dbl gt    lt dbl gt    lt dbl gt    1  2000     1   11 7  52 9    2  2000     2  -74 1 126      3  2000     3 -132   149      4  2000     4 -130     4 12   5  2000     5  -91 6 -55 9    6  2000     6  179    73 7    7  2000     7   95 0 409      8  2000     8  255   283      9  2000     9  489   331     10  2000    10  719   305          with 14 more rows

User · Answer

Where is this year   function from   You could also use the reshape2 package for this task   require reshape2  df melt  lt - melt df1  id   c  date    year    month    dcast df melt  year   month   variable  sum     year month         x1           x2 1  2000     1  -80 83405 -224 9540159 2  2000     2 -223 76331 -288 2418017 3  2000     3 -188 83930 -481 5601913 4  2000     4 -197 47797 -473 7137420 5  2000     5 -259 07928 -372 4563522

User · Answer

Interestingly  base R aggregate s data frame method is not showcased here  above the formula interface is used  so for completeness   aggregate    x   df1 c  x1    x2       by   df1 c  year    month       FUN   sum  na rm   TRUE     More generic use of aggregate s data frame method   Since we are providing a   data frame as x and  a list  data frame is also a list  as by  this is very useful if we need to use it in a dynamic manner  e g  using other columns to be aggregated and to aggregate by is very simple also with custom-made aggregation functions    For example like so   colsToAggregate  lt - c  x1   aggregateBy  lt - c  year    month   dummyaggfun  lt - function v  na rm   TRUE      c sum   sum v  na rm   na rm   mean   mean v  na rm   na rm      aggregate df1 colsToAggregate   by   df1 aggregateBy   FUN   dummyaggfun

User · Answer

For a more flexible and faster approach to data aggregation  check out the collap function in the collapse R package available on CRAN   library collapse    Simple aggregation with one function head collap df1  x1   x2   year   month  fmean      year month        x1        x2 1 2000     1 -1 217984  4 008534 2 2000     2 -1 117777 11 460301 3 2000     3  5 552706  8 621904 4 2000     4  4 238889 22 382953 5 2000     5  3 124566 39 982799 6 2000     6 -1 415203 48 252283    Customized  Aggregate columns with different functions head collap df1  x1   x2   year   month         custom   list fmean   c  x1    x2    fmedian    x2        year month  fmean x1  fmean x2 fmedian x2 1 2000     1 -1 217984  4 008534   3 266968 2 2000     2 -1 117777 11 460301  11 563387 3 2000     3  5 552706  8 621904   8 506329 4 2000     4  4 238889 22 382953  20 796205 5 2000     5  3 124566 39 982799  39 919145 6 2000     6 -1 415203 48 252283  48 653926    You can also apply multiple functions to all columns head collap df1  x1   x2   year   month  list fmean  fmin  fmax       year month  fmean x1    fmin x1  fmax x1  fmean x2   fmin x2  fmax x2 1 2000     1 -1 217984 -4 2460775 1 245649  4 008534 -1 720181 10 47825 2 2000     2 -1 117777 -5 0081858 3 330872 11 460301  9 111287 13 86184 3 2000     3  5 552706  0 1193369 9 464760  8 621904  6 807443 11 54485 4 2000     4  4 238889  0 8723805 8 627637 22 382953 11 515753 31 66365 5 2000     5  3 124566 -1 5985090 7 341478 39 982799 31 957653 46 13732 6 2000     6 -1 415203 -4 6072295 2 655084 48 252283 42 809211 52 31309    When you do that  you can also return the data in a long format head collap df1  x1   x2   year   month  list fmean  fmin  fmax   return    long       Function year month        x1        x2 1    fmean 2000     1 -1 217984  4 008534 2    fmean 2000     2 -1 117777 11 460301 3    fmean 2000     3  5 552706  8 621904 4    fmean 2000     4  4 238889 22 382953 5    fmean 2000     5  3 124566 39 982799 6    fmean 2000     6 -1 415203 48 252283    Note  You can use base functions like mean  max etc  with collap  but fmean  fmax etc  are C   based grouped functions offered in the collapse package which are significantly faster  i e  the performance on large data aggregations is the same as data table while providing greater flexibility  and these fast grouped functions can also be used without collap     Note2  collap also supports flexible multitype data aggregation  which you can of course do using the custom argument  but you can also apply functions to numeric and non-numeric columns in a semi-automated way      wlddev is a data set of World Bank Indicators provided in the collapse package head wlddev         country iso3c       date year decade     region     income  OECD PCGDP LIFEEX GINI       ODA 1 Afghanistan   AFG 1961-01-01 1960   1960 South Asia Low income FALSE    NA 32 292   NA 114440000 2 Afghanistan   AFG 1962-01-01 1961   1960 South Asia Low income FALSE    NA 32 742   NA 233350000 3 Afghanistan   AFG 1963-01-01 1962   1960 South Asia Low income FALSE    NA 33 185   NA 114880000 4 Afghanistan   AFG 1964-01-01 1963   1960 South Asia Low income FALSE    NA 33 624   NA 236450000 5 Afghanistan   AFG 1965-01-01 1964   1960 South Asia Low income FALSE    NA 34 060   NA 302480000 6 Afghanistan   AFG 1966-01-01 1965   1960 South Asia Low income FALSE    NA 34 495   NA 370250000    This aggregates the data  applying the mean to numeric and the statistical mode to categorical columns head collap wlddev    iso3c   decade  FUN   fmean  catFUN   fmode      country iso3c       date   year decade                     region      income  OECD    PCGDP   LIFEEX GINI      ODA 1   Aruba   ABW 1961-01-01 1962 5   1960 Latin America  amp  Caribbean  High income FALSE       NA 66 58583   NA       NA 2   Aruba   ABW 1967-01-01 1970 0   1970 Latin America  amp  Caribbean  High income FALSE       NA 69 14178   NA       NA 3   Aruba   ABW 1976-01-01 1980 0   1980 Latin America  amp  Caribbean  High income FALSE       NA 72 17600   NA 33630000 4   Aruba   ABW 1987-01-01 1990 0   1990 Latin America  amp  Caribbean  High income FALSE 23677 09 73 45356   NA 41563333 5   Aruba   ABW 1996-01-01 2000 0   2000 Latin America  amp  Caribbean  High income FALSE 26766 93 73 85773   NA 19857000 6   Aruba   ABW 2007-01-01 2010 0   2010 Latin America  amp  Caribbean  High income FALSE 25238 80 75 01078   NA       NA    Note that by default  argument keep col order   TRUE  the column order is also preserved

User · Answer

Using the data table package  which is fast  useful for larger datasets   https   github com Rdatatable data table wiki   library data table  df2  lt - setDT df1    lapply  SD  sum   by   year  month    SDcols c  x1   x2    setDF df2    convert back to dataframe   Using the plyr package  require plyr  df2  lt - ddply df1  c  year    month    function x  colSums x c  x1    x2        Using summarize   from the Hmisc package   column headings are messy in my example though     need to detach plyr because plyr and Hmisc both have a summarize   detach package plyr  require Hmisc  df2  lt - with df1  summarize  cbind x1  x2   by llist year  month   FUN colSums

User · Answer

Late to the party  but recently found another way to get the summary statistics   library psych  describe data   Will output   mean  min  max  standard deviation  n  standard error  kurtosis  skewness  median  and range for each variable

User · Answer

Yes  in your formula  you can cbind the numeric variables to be aggregated   aggregate cbind x1  x2    year   month  data   df1  sum  na rm   TRUE     year month         x1          x2 1  2000     1   7 862002   -7 469298 2  2001     1 276 758209  474 384252 3  2000     2  13 122369 -128 122613     23 2000    12  63 436507  449 794454 24 2001    12 999 472226  922 726589     See  aggregate  the formula argument and the examples

[r] Aggregate / summarize multiple variables per group (e.g. sum, mean)

Examples related to r

Examples related to dataframe

Examples related to data.table

Examples related to aggregate

Examples related to r-faq