How to sum a variable by group

Question

I have a data frame with two columns  First column contains categories such as  First    Second    Third   and the second column has numbers that represent the number of times I saw the specific groups from  Category    For example   Category     Frequency First        10 First        15 First        5 Second       2 Third        14 Third        20 Second       3   I want to sort the data by Category and sum all the Frequencies   Category     Frequency First        30 Second       5 Third        34   How would I do this in R

User · Answer

Another solution that returns sums by groups in a matrix or a data frame and is short and fast   rowsum x Frequency  x Category

User · Answer

Since dplyr 1 0 0  the across   function could be used   df   gt    group by Category    gt    summarise across Frequency  sum      Category Frequency    lt chr gt          lt int gt  1 First           30 2 Second           5 3 Third           34   If interested in multiple variables   df   gt    group by Category    gt    summarise across c Frequency  Frequency2   sum      Category Frequency Frequency2    lt chr gt          lt int gt        lt int gt  1 First           30         55 2 Second           5         29 3 Third           34        190   And the selection of variables using select helpers   df   gt    group by Category    gt    summarise across starts with  Freq    sum      Category Frequency Frequency2 Frequency3    lt chr gt          lt int gt        lt int gt        lt dbl gt  1 First           30         55        110 2 Second           5         29         58 3 Third           34        190        380   Sample data   df  lt - read table text    Category Frequency Frequency2 Frequency3                  1    First        10         10         20                  2    First        15         30         60                  3    First         5         15         30                  4   Second         2          8         16                  5    Third        14         70        140                  6    Third        20        120        240                  7   Second         3         21         42                    header   TRUE                   stringsAsFactors   FALSE

User · Answer

Several years later  just to add another simple base R solution that isn t present here for some reason- xtabs  xtabs Frequency   Category  df    Category   First Second  Third       30      5     34    Or if you want a data frame back  as data frame xtabs Frequency   Category  df       Category Freq   1    First   30   2   Second    5   3    Third   34

User · Answer

If x is a dataframe with your data  then the following will do what you want   require reshape  recast x  Category      fun aggregate sum

User · Answer

You could use the function group sum from package Rfast   Category  lt - Rfast  as integer Category result sort FALSE    convert character to numeric  R s as numeric produce NAs  result  lt - Rfast  group sum Frequency Category  names result   lt - Rfast  Sort unique Category    30 5 34   Rfast has many group functions and group sum is one of them

User · Answer

Just to add a third option   require doBy  summaryBy Frequency Category  data yourdataframe  FUN sum    EDIT  this is a very old answer  Now I would recommend the use of group by and summarise from dplyr  as in  docendo answer

User · Answer

library tidyverse   x  lt - data frame Category  c  First    First    First    Second    Third    Third    Second                Frequency   c 10  15  5  2  14  20  3    count x  Category  wt   Frequency

User · Answer

I find ave very helpful  and efficient  when you need to apply different aggregation functions on different columns  and you must want to stick on base R     e g   Given this input    DF  lt -                 data frame Categ1 factor c  A   A   B   B   A   B   A                Categ2 factor c  X   Y   X   X   X   Y   Y                Samples c 1 2 4 3 5 6 7              Freq c 10 30 45 55 80 65 50     gt  DF   Categ1 Categ2 Samples Freq 1      A      X       1   10 2      A      Y       2   30 3      B      X       4   45 4      B      X       3   55 5      A      X       5   80 6      B      Y       6   65 7      A      Y       7   50   we want to group by Categ1 and Categ2 and compute the sum of Samples and mean of Freq  Here s a possible solution using ave      create a copy of DF  only the grouping columns  DF2  lt - DF  c  Categ1   Categ2       add sum of Samples by Categ1 Categ2 to DF2     ave repeats the sum of the group for each row in the same group  DF2 GroupTotSamples  lt - ave DF Samples DF2 FUN sum     add mean of Freq by Categ1 Categ2 to DF2     ave repeats the mean of the group for each row in the same group  DF2 GroupAvgFreq  lt - ave DF Freq DF2 FUN mean     remove the duplicates  keep only one row for each group  DF2  lt - DF2  duplicated DF2      Result      gt  DF2   Categ1 Categ2 GroupTotSamples GroupAvgFreq 1      A      X               6           45 2      A      Y               9           40 3      B      X               7           50 6      B      Y               6           65

User · Answer

The answer provided by rcs works and is simple  However  if you are handling larger datasets and need a performance boost there is a faster alternative   library data table  data   data table Category c  First   First   First   Second   Third    Third    Second                       Frequency c 10 15 5 2 14 20 3   data   sum Frequency   by   Category       Category V1   1     First 30   2    Second  5   3     Third 34 system time data   sum Frequency   by   Category      user    system   elapsed    0 008     0 001     0 009    Let s compare that to the same thing using data frame and the above above   data   data frame Category c  First   First   First   Second   Third    Third    Second                      Frequency c 10 15 5 2 14 20 3   system time aggregate data Frequency  by list Category data Category   FUN sum     user    system   elapsed    0 008     0 000     0 015    And if you want to keep the column this is the syntax   data  list Frequency sum Frequency   by Category       Category Frequency   1     First        30   2    Second         5   3     Third        34   The difference will become more noticeable with larger datasets  as the code below demonstrates   data   data table Category rep c  First    Second    Third    100000                     Frequency rnorm 100000   system time  data  sum Frequency  by Category      user    system   elapsed    0 055     0 004     0 059  data   data frame Category rep c  First    Second    Third    100000                      Frequency rnorm 100000   system time  aggregate data Frequency  by list Category data Category   FUN sum      user    system   elapsed    0 287     0 010     0 296      For multiple aggregations  you can combine lapply and  SD as follows  data   lapply  SD  sum   by   Category       Category Frequency   1     First        30   2    Second         5   3     Third        34

User · Answer

You can also use the by   function   x2  lt - by x Frequency  x Category  sum  do call rbind as list x2     Those other packages  plyr  reshape  have the benefit of returning a data frame  but it s worth being familiar with by   since it s a base function

User · Answer

Using aggregate   aggregate x Frequency  by list Category x Category   FUN sum    Category  x 1    First 30 2   Second  5 3    Third 34     In the example above  multiple dimensions can be specified in the list  Multiple aggregated metrics of the same data type can be incorporated via cbind   aggregate cbind x Frequency  x Metric2  x Metric3           embedding  thelatemail comment   aggregate has a formula interface too  aggregate Frequency   Category  x  sum    Or if you want to aggregate multiple columns  you could use the   notation  works for one column too   aggregate     Category  x  sum      or tapply   tapply x Frequency  x Category  FUN sum   First Second  Third      30      5     34      Using this data   x  lt - data frame Category factor c  First    First    First    Second                                          Third    Third    Second                          Frequency c 10 15 5 2 14 20 3

User · Answer

While I have recently become a convert to dplyr for most of these types of operations  the sqldf package is still really nice  and IMHO more readable  for some things    Here is an example of how this question can be answered with sqldf  x  lt - data frame Category factor c  First    First    First    Second                                      Third    Third    Second                      Frequency c 10 15 5 2 14 20 3    sqldf  select            Category            sum Frequency  as Frequency         from x         group by            Category         Category Frequency    1    First        30    2   Second         5    3    Third        34

User · Answer

You can also use the dplyr package for that purpose  library dplyr  x   gt      group by Category    gt      summarise Frequency   sum Frequency     Source  local data frame  3 x 2       Category Frequency  1    First        30  2   Second         5  3    Third        34  Or  for multiple summary columns  works with one column too   x   gt      group by Category    gt      summarise across everything    sum    Here are some more examples of how to summarise data by group using dplyr functions using the built-in dataset mtcars    several summary columns with arbitrary names mtcars   gt      group by cyl  gear    gt                                multiple group columns   summarise max hp   max hp   mean mpg   mean mpg      multiple summary columns    summarise all columns except grouping columns using  quot sum quot   mtcars   gt      group by cyl    gt      summarise across everything    sum      summarise all columns except grouping columns using  quot sum quot  and  quot mean quot  mtcars   gt      group by cyl    gt      summarise across everything    list mean   mean  sum   sum       multiple grouping columns mtcars   gt      group by cyl  gear    gt      summarise across everything    list mean   mean  sum   sum       summarise specific variables  not all mtcars   gt      group by cyl  gear    gt      summarise across c qsec  mpg  wt   list mean   mean  sum   sum       summarise specific variables  numeric columns except grouping columns  mtcars   gt      group by gear    gt      summarise across where is numeric   list mean   mean  sum   sum     For more information  including the   gt   operator  see the introduction to dplyr

User · Answer

using cast instead of recast  note  Frequency  is now  value    df   lt - data frame Category   c  First   First   First   Second   Third   Third   Second                       value   c 10 15 5 2 14 20 3    install packages  reshape    result lt -cast df  Category      fun aggregate sum    to get   Category  all  First     30 Second    5 Third     34

User · Answer

library plyr  ddply tbl    Category   summarise  sum   sum Frequency

[r] How to sum a variable by group

Examples related to r

Examples related to dataframe

Examples related to aggregate

Examples related to r-faq