I have a data frame with two columns. First column contains categories such as "First", "Second", "Third", and the second column has numbers that represent the number of times I saw the specific groups from "Category".
For example:
Category Frequency
First 10
First 15
First 5
Second 2
Third 14
Third 20
Second 3
I want to sort the data by Category and sum all the Frequencies:
Category Frequency
First 30
Second 5
Third 34
How would I do this in R?
If x
is a dataframe with your data, then the following will do what you want:
require(reshape)
recast(x, Category ~ ., fun.aggregate=sum)
using cast
instead of recast
(note 'Frequency'
is now 'value'
)
df <- data.frame(Category = c("First","First","First","Second","Third","Third","Second")
, value = c(10,15,5,2,14,20,3))
install.packages("reshape")
result<-cast(df, Category ~ . ,fun.aggregate=sum)
to get:
Category (all)
First 30
Second 5
Third 34
Another solution that returns sums by groups in a matrix or a data frame and is short and fast:
rowsum(x$Frequency, x$Category)
While I have recently become a convert to dplyr
for most of these types of operations, the sqldf
package is still really nice (and IMHO more readable) for some things.
Here is an example of how this question can be answered with sqldf
x <- data.frame(Category=factor(c("First", "First", "First", "Second",
"Third", "Third", "Second")),
Frequency=c(10,15,5,2,14,20,3))
sqldf("select
Category
,sum(Frequency) as Frequency
from x
group by
Category")
## Category Frequency
## 1 First 30
## 2 Second 5
## 3 Third 34
library(plyr)
ddply(tbl, .(Category), summarise, sum = sum(Frequency))
Just to add a third option:
require(doBy)
summaryBy(Frequency~Category, data=yourdataframe, FUN=sum)
EDIT: this is a very old answer. Now I would recommend the use of group_by
and summarise
from dplyr
, as in @docendo answer.
You can also use the by() function:
x2 <- by(x$Frequency, x$Category, sum)
do.call(rbind,as.list(x2))
Those other packages (plyr, reshape) have the benefit of returning a data.frame, but it's worth being familiar with by() since it's a base function.
The answer provided by rcs works and is simple. However, if you are handling larger datasets and need a performance boost there is a faster alternative:
library(data.table)
data = data.table(Category=c("First","First","First","Second","Third", "Third", "Second"),
Frequency=c(10,15,5,2,14,20,3))
data[, sum(Frequency), by = Category]
# Category V1
# 1: First 30
# 2: Second 5
# 3: Third 34
system.time(data[, sum(Frequency), by = Category] )
# user system elapsed
# 0.008 0.001 0.009
Let's compare that to the same thing using data.frame and the above above:
data = data.frame(Category=c("First","First","First","Second","Third", "Third", "Second"),
Frequency=c(10,15,5,2,14,20,3))
system.time(aggregate(data$Frequency, by=list(Category=data$Category), FUN=sum))
# user system elapsed
# 0.008 0.000 0.015
And if you want to keep the column this is the syntax:
data[,list(Frequency=sum(Frequency)),by=Category]
# Category Frequency
# 1: First 30
# 2: Second 5
# 3: Third 34
The difference will become more noticeable with larger datasets, as the code below demonstrates:
data = data.table(Category=rep(c("First", "Second", "Third"), 100000),
Frequency=rnorm(100000))
system.time( data[,sum(Frequency),by=Category] )
# user system elapsed
# 0.055 0.004 0.059
data = data.frame(Category=rep(c("First", "Second", "Third"), 100000),
Frequency=rnorm(100000))
system.time( aggregate(data$Frequency, by=list(Category=data$Category), FUN=sum) )
# user system elapsed
# 0.287 0.010 0.296
For multiple aggregations, you can combine lapply
and .SD
as follows
data[, lapply(.SD, sum), by = Category]
# Category Frequency
# 1: First 30
# 2: Second 5
# 3: Third 34
I find ave
very helpful (and efficient) when you need to apply different aggregation functions on different columns (and you must/want to stick on base R) :
e.g.
Given this input :
DF <-
data.frame(Categ1=factor(c('A','A','B','B','A','B','A')),
Categ2=factor(c('X','Y','X','X','X','Y','Y')),
Samples=c(1,2,4,3,5,6,7),
Freq=c(10,30,45,55,80,65,50))
> DF
Categ1 Categ2 Samples Freq
1 A X 1 10
2 A Y 2 30
3 B X 4 45
4 B X 3 55
5 A X 5 80
6 B Y 6 65
7 A Y 7 50
we want to group by Categ1
and Categ2
and compute the sum of Samples
and mean of Freq
.
Here's a possible solution using ave
:
# create a copy of DF (only the grouping columns)
DF2 <- DF[,c('Categ1','Categ2')]
# add sum of Samples by Categ1,Categ2 to DF2
# (ave repeats the sum of the group for each row in the same group)
DF2$GroupTotSamples <- ave(DF$Samples,DF2,FUN=sum)
# add mean of Freq by Categ1,Categ2 to DF2
# (ave repeats the mean of the group for each row in the same group)
DF2$GroupAvgFreq <- ave(DF$Freq,DF2,FUN=mean)
# remove the duplicates (keep only one row for each group)
DF2 <- DF2[!duplicated(DF2),]
Result :
> DF2
Categ1 Categ2 GroupTotSamples GroupAvgFreq
1 A X 6 45
2 A Y 9 40
3 B X 7 50
6 B Y 6 65
You could use the function group.sum
from package Rfast.
Category <- Rfast::as_integer(Category,result.sort=FALSE) # convert character to numeric. R's as.numeric produce NAs.
result <- Rfast::group.sum(Frequency,Category)
names(result) <- Rfast::Sort(unique(Category)
# 30 5 34
Rfast has many group functions and group.sum
is one of them.
Since dplyr 1.0.0
, the across()
function could be used:
df %>%
group_by(Category) %>%
summarise(across(Frequency, sum))
Category Frequency
<chr> <int>
1 First 30
2 Second 5
3 Third 34
If interested in multiple variables:
df %>%
group_by(Category) %>%
summarise(across(c(Frequency, Frequency2), sum))
Category Frequency Frequency2
<chr> <int> <int>
1 First 30 55
2 Second 5 29
3 Third 34 190
And the selection of variables using select helpers:
df %>%
group_by(Category) %>%
summarise(across(starts_with("Freq"), sum))
Category Frequency Frequency2 Frequency3
<chr> <int> <int> <dbl>
1 First 30 55 110
2 Second 5 29 58
3 Third 34 190 380
Sample data:
df <- read.table(text = "Category Frequency Frequency2 Frequency3
1 First 10 10 20
2 First 15 30 60
3 First 5 15 30
4 Second 2 8 16
5 Third 14 70 140
6 Third 20 120 240
7 Second 3 21 42",
header = TRUE,
stringsAsFactors = FALSE)
library(tidyverse)
x <- data.frame(Category= c('First', 'First', 'First', 'Second', 'Third', 'Third', 'Second'),
Frequency = c(10, 15, 5, 2, 14, 20, 3))
count(x, Category, wt = Frequency)
You can also use the dplyr package for that purpose:
library(dplyr)
x %>%
group_by(Category) %>%
summarise(Frequency = sum(Frequency))
#Source: local data frame [3 x 2]
#
# Category Frequency
#1 First 30
#2 Second 5
#3 Third 34
Or, for multiple summary columns (works with one column too):
x %>%
group_by(Category) %>%
summarise(across(everything(), sum))
Here are some more examples of how to summarise data by group using dplyr functions using the built-in dataset mtcars
:
# several summary columns with arbitrary names
mtcars %>%
group_by(cyl, gear) %>% # multiple group columns
summarise(max_hp = max(hp), mean_mpg = mean(mpg)) # multiple summary columns
# summarise all columns except grouping columns using "sum"
mtcars %>%
group_by(cyl) %>%
summarise(across(everything(), sum))
# summarise all columns except grouping columns using "sum" and "mean"
mtcars %>%
group_by(cyl) %>%
summarise(across(everything(), list(mean = mean, sum = sum)))
# multiple grouping columns
mtcars %>%
group_by(cyl, gear) %>%
summarise(across(everything(), list(mean = mean, sum = sum)))
# summarise specific variables, not all
mtcars %>%
group_by(cyl, gear) %>%
summarise(across(c(qsec, mpg, wt), list(mean = mean, sum = sum)))
# summarise specific variables (numeric columns except grouping columns)
mtcars %>%
group_by(gear) %>%
summarise(across(where(is.numeric), list(mean = mean, sum = sum)))
For more information, including the %>%
operator, see the introduction to dplyr.
Several years later, just to add another simple base R solution that isn't present here for some reason- xtabs
xtabs(Frequency ~ Category, df)
# Category
# First Second Third
# 30 5 34
Or if you want a data.frame
back
as.data.frame(xtabs(Frequency ~ Category, df))
# Category Freq
# 1 First 30
# 2 Second 5
# 3 Third 34
Source: Stackoverflow.com