[r] Calculate the mean by group

I have a large data frame that looks similar to this:

df <- data.frame(dive = factor(sample(c("dive1","dive2"), 10, replace=TRUE)),
                 speed = runif(10)
                 )
> df
    dive      speed
1  dive1 0.80668490
2  dive1 0.53349584
3  dive2 0.07571784
4  dive2 0.39518628
5  dive1 0.84557955
6  dive1 0.69121443
7  dive1 0.38124950
8  dive2 0.22536126
9  dive1 0.04704750
10 dive2 0.93561651

My goal is to obtain the average of values in one column when another column is equal to a certain value and repeat this for all values. i.e. in the example above I would like to return an average for the column speed for every unique value of the column dive. So when dive==dive1, the average for speed is this and so on for each value of dive.

This question is related to r dataframe r-faq

The answer is


Adding alternative base R approach, which remains fast under various cases.

rowsummean <- function(df) {
  rowsum(df$speed, df$dive) / tabulate(df$dive)
}

Borrowing the benchmarks from @Ari:

10 rows, 2 groups

res1

10 million rows, 10 groups

res2

10 million rows, 1000 groups

res3


We already have tons of options to get mean by group, adding one more from mosaic package.

mosaic::mean(speed~dive, data = df)
#dive1 dive2 
#0.579 0.440 

This returns a named numeric vector, if needed a dataframe we can wrap it in stack

stack(mosaic::mean(speed~dive, data = df))

#  values   ind
#1  0.579 dive1
#2  0.440 dive2

data

set.seed(123)
df <- data.frame(dive=factor(sample(c("dive1","dive2"),10,replace=TRUE)),
                 speed=runif(10))

2015 update with dplyr:

df %>% group_by(dive) %>% summarise(percentage = mean(speed))
Source: local data frame [2 x 2]

   dive percentage
1 dive1  0.4777462
2 dive2  0.6726483

aggregate(speed~dive,data=df,FUN=mean)
   dive     speed
1 dive1 0.7059729
2 dive2 0.5473777

Examples related to r

How to get AIC from Conway–Maxwell-Poisson regression via COM-poisson package in R? R : how to simply repeat a command? session not created: This version of ChromeDriver only supports Chrome version 74 error with ChromeDriver Chrome using Selenium How to show code but hide output in RMarkdown? remove kernel on jupyter notebook Function to calculate R2 (R-squared) in R Center Plot title in ggplot2 R ggplot2: stat_count() must not be used with a y aesthetic error in Bar graph R multiple conditions in if statement What does "The following object is masked from 'package:xxx'" mean?

Examples related to dataframe

Trying to merge 2 dataframes but get ValueError How to show all of columns name on pandas dataframe? Python Pandas - Find difference between two data frames Pandas get the most frequent values of a column Display all dataframe columns in a Jupyter Python Notebook How to convert column with string type to int form in pyspark data frame? Display/Print one column from a DataFrame of Series in Pandas Binning column with python pandas Selection with .loc in python Set value to an entire column of a pandas dataframe

Examples related to r-faq

What does "The following object is masked from 'package:xxx'" mean? What does "Error: object '<myvariable>' not found" mean? How do I deal with special characters like \^$.?*|+()[{ in my regex? What does %>% function mean in R? How to plot a function curve in R Use dynamic variable names in `dplyr` Error: unexpected symbol/input/string constant/numeric constant/SPECIAL in my code How should I deal with "package 'xxx' is not available (for R version x.y.z)" warning? How to select the row with the maximum value in each group R data formats: RData, Rda, Rds etc