[r] Grouping functions (tapply, by, aggregate) and the *apply family

Whenever I want to do something "map"py in R, I usually try to use a function in the apply family.

However, I've never quite understood the differences between them -- how {sapply, lapply, etc.} apply the function to the input/grouped input, what the output will look like, or even what the input can be -- so I often just go through them all until I get what I want.

Can someone explain how to use which one when?

My current (probably incorrect/incomplete) understanding is...

  1. sapply(vec, f): input is a vector. output is a vector/matrix, where element i is f(vec[i]), giving you a matrix if f has a multi-element output

  2. lapply(vec, f): same as sapply, but output is a list?

  3. apply(matrix, 1/2, f): input is a matrix. output is a vector, where element i is f(row/col i of the matrix)
  4. tapply(vector, grouping, f): output is a matrix/array, where an element in the matrix/array is the value of f at a grouping g of the vector, and g gets pushed to the row/col names
  5. by(dataframe, grouping, f): let g be a grouping. apply f to each column of the group/dataframe. pretty print the grouping and the value of f at each column.
  6. aggregate(matrix, grouping, f): similar to by, but instead of pretty printing the output, aggregate sticks everything into a dataframe.

Side question: I still haven't learned plyr or reshape -- would plyr or reshape replace all of these entirely?

This question is related to r lapply sapply tapply r-faq

The answer is


I recently discovered the rather useful sweep function and add it here for the sake of completeness:

sweep

The basic idea is to sweep through an array row- or column-wise and return a modified array. An example will make this clear (source: datacamp):

Let's say you have a matrix and want to standardize it column-wise:

dataPoints <- matrix(4:15, nrow = 4)

# Find means per column with `apply()`
dataPoints_means <- apply(dataPoints, 2, mean)

# Find standard deviation with `apply()`
dataPoints_sdev <- apply(dataPoints, 2, sd)

# Center the points 
dataPoints_Trans1 <- sweep(dataPoints, 2, dataPoints_means,"-")

# Return the result
dataPoints_Trans1
##      [,1] [,2] [,3]
## [1,] -1.5 -1.5 -1.5
## [2,] -0.5 -0.5 -0.5
## [3,]  0.5  0.5  0.5
## [4,]  1.5  1.5  1.5

# Normalize
dataPoints_Trans2 <- sweep(dataPoints_Trans1, 2, dataPoints_sdev, "/")

# Return the result
dataPoints_Trans2
##            [,1]       [,2]       [,3]
## [1,] -1.1618950 -1.1618950 -1.1618950
## [2,] -0.3872983 -0.3872983 -0.3872983
## [3,]  0.3872983  0.3872983  0.3872983
## [4,]  1.1618950  1.1618950  1.1618950

NB: for this simple example the same result can of course be achieved more easily by
apply(dataPoints, 2, scale)


First start with Joran's excellent answer -- doubtful anything can better that.

Then the following mnemonics may help to remember the distinctions between each. Whilst some are obvious, others may be less so --- for these you'll find justification in Joran's discussions.

Mnemonics

  • lapply is a list apply which acts on a list or vector and returns a list.
  • sapply is a simple lapply (function defaults to returning a vector or matrix when possible)
  • vapply is a verified apply (allows the return object type to be prespecified)
  • rapply is a recursive apply for nested lists, i.e. lists within lists
  • tapply is a tagged apply where the tags identify the subsets
  • apply is generic: applies a function to a matrix's rows or columns (or, more generally, to dimensions of an array)

Building the Right Background

If using the apply family still feels a bit alien to you, then it might be that you're missing a key point of view.

These two articles can help. They provide the necessary background to motivate the functional programming techniques that are being provided by the apply family of functions.

Users of Lisp will recognise the paradigm immediately. If you're not familiar with Lisp, once you get your head around FP, you'll have gained a powerful point of view for use in R -- and apply will make a lot more sense.


On the side note, here is how the various plyr functions correspond to the base *apply functions (from the intro to plyr document from the plyr webpage http://had.co.nz/plyr/)

Base function   Input   Output   plyr function 
---------------------------------------
aggregate        d       d       ddply + colwise 
apply            a       a/l     aaply / alply 
by               d       l       dlply 
lapply           l       l       llply  
mapply           a       a/l     maply / mlply 
replicate        r       a/l     raply / rlply 
sapply           l       a       laply 

One of the goals of plyr is to provide consistent naming conventions for each of the functions, encoding the input and output data types in the function name. It also provides consistency in output, in that output from dlply() is easily passable to ldply() to produce useful output, etc.

Conceptually, learning plyr is no more difficult than understanding the base *apply functions.

plyr and reshape functions have replaced almost all of these functions in my every day use. But, also from the Intro to Plyr document:

Related functions tapply and sweep have no corresponding function in plyr, and remain useful. merge is useful for combining summaries with the original data.


Despite all the great answers here, there are 2 more base functions that deserve to be mentioned, the useful outer function and the obscure eapply function

outer

outer is a very useful function hidden as a more mundane one. If you read the help for outer its description says:

The outer product of the arrays X and Y is the array A with dimension  
c(dim(X), dim(Y)) where element A[c(arrayindex.x, arrayindex.y)] =   
FUN(X[arrayindex.x], Y[arrayindex.y], ...).

which makes it seem like this is only useful for linear algebra type things. However, it can be used much like mapply to apply a function to two vectors of inputs. The difference is that mapply will apply the function to the first two elements and then the second two etc, whereas outer will apply the function to every combination of one element from the first vector and one from the second. For example:

 A<-c(1,3,5,7,9)
 B<-c(0,3,6,9,12)

mapply(FUN=pmax, A, B)

> mapply(FUN=pmax, A, B)
[1]  1  3  6  9 12

outer(A,B, pmax)

 > outer(A,B, pmax)
      [,1] [,2] [,3] [,4] [,5]
 [1,]    1    3    6    9   12
 [2,]    3    3    6    9   12
 [3,]    5    5    6    9   12
 [4,]    7    7    7    9   12
 [5,]    9    9    9    9   12

I have personally used this when I have a vector of values and a vector of conditions and wish to see which values meet which conditions.

eapply

eapply is like lapply except that rather than applying a function to every element in a list, it applies a function to every element in an environment. For example if you want to find a list of user defined functions in the global environment:

A<-c(1,3,5,7,9)
B<-c(0,3,6,9,12)
C<-list(x=1, y=2)
D<-function(x){x+1}

> eapply(.GlobalEnv, is.function)
$A
[1] FALSE

$B
[1] FALSE

$C
[1] FALSE

$D
[1] TRUE 

Frankly I don't use this very much but if you are building a lot of packages or create a lot of environments it may come in handy.


It is maybe worth mentioning ave. ave is tapply's friendly cousin. It returns results in a form that you can plug straight back into your data frame.

dfr <- data.frame(a=1:20, f=rep(LETTERS[1:5], each=4))
means <- tapply(dfr$a, dfr$f, mean)
##  A    B    C    D    E 
## 2.5  6.5 10.5 14.5 18.5 

## great, but putting it back in the data frame is another line:

dfr$m <- means[dfr$f]

dfr$m2 <- ave(dfr$a, dfr$f, FUN=mean) # NB argument name FUN is needed!
dfr
##   a f    m   m2
##   1 A  2.5  2.5
##   2 A  2.5  2.5
##   3 A  2.5  2.5
##   4 A  2.5  2.5
##   5 B  6.5  6.5
##   6 B  6.5  6.5
##   7 B  6.5  6.5
##   ...

There is nothing in the base package that works like ave for whole data frames (as by is like tapply for data frames). But you can fudge it:

dfr$foo <- ave(1:nrow(dfr), dfr$f, FUN=function(x) {
    x <- dfr[x,]
    sum(x$m*x$m2)
})
dfr
##     a f    m   m2    foo
## 1   1 A  2.5  2.5    25
## 2   2 A  2.5  2.5    25
## 3   3 A  2.5  2.5    25
## ...

There are lots of great answers which discuss differences in the use cases for each function. None of the answer discuss the differences in performance. That is reasonable cause various functions expects various input and produces various output, yet most of them have a general common objective to evaluate by series/groups. My answer is going to focus on performance. Due to above the input creation from the vectors is included in the timing, also the apply function is not measured.

I have tested two different functions sum and length at once. Volume tested is 50M on input and 50K on output. I have also included two currently popular packages which were not widely used at the time when question was asked, data.table and dplyr. Both are definitely worth to look if you are aiming for good performance.

library(dplyr)
library(data.table)
set.seed(123)
n = 5e7
k = 5e5
x = runif(n)
grp = sample(k, n, TRUE)

timing = list()

# sapply
timing[["sapply"]] = system.time({
    lt = split(x, grp)
    r.sapply = sapply(lt, function(x) list(sum(x), length(x)), simplify = FALSE)
})

# lapply
timing[["lapply"]] = system.time({
    lt = split(x, grp)
    r.lapply = lapply(lt, function(x) list(sum(x), length(x)))
})

# tapply
timing[["tapply"]] = system.time(
    r.tapply <- tapply(x, list(grp), function(x) list(sum(x), length(x)))
)

# by
timing[["by"]] = system.time(
    r.by <- by(x, list(grp), function(x) list(sum(x), length(x)), simplify = FALSE)
)

# aggregate
timing[["aggregate"]] = system.time(
    r.aggregate <- aggregate(x, list(grp), function(x) list(sum(x), length(x)), simplify = FALSE)
)

# dplyr
timing[["dplyr"]] = system.time({
    df = data_frame(x, grp)
    r.dplyr = summarise(group_by(df, grp), sum(x), n())
})

# data.table
timing[["data.table"]] = system.time({
    dt = setnames(setDT(list(x, grp)), c("x","grp"))
    r.data.table = dt[, .(sum(x), .N), grp]
})

# all output size match to group count
sapply(list(sapply=r.sapply, lapply=r.lapply, tapply=r.tapply, by=r.by, aggregate=r.aggregate, dplyr=r.dplyr, data.table=r.data.table), 
       function(x) (if(is.data.frame(x)) nrow else length)(x)==k)
#    sapply     lapply     tapply         by  aggregate      dplyr data.table 
#      TRUE       TRUE       TRUE       TRUE       TRUE       TRUE       TRUE 

# print timings
as.data.table(sapply(timing, `[[`, "elapsed"), keep.rownames = TRUE
              )[,.(fun = V1, elapsed = V2)
                ][order(-elapsed)]
#          fun elapsed
#1:  aggregate 109.139
#2:         by  25.738
#3:      dplyr  18.978
#4:     tapply  17.006
#5:     lapply  11.524
#6:     sapply  11.326
#7: data.table   2.686

From slide 21 of http://www.slideshare.net/hadley/plyr-one-data-analytic-strategy:

apply, sapply, lapply, by, aggregate

(Hopefully it's clear that apply corresponds to @Hadley's aaply and aggregate corresponds to @Hadley's ddply etc. Slide 20 of the same slideshare will clarify if you don't get it from this image.)

(on the left is input, on the top is output)


Since I realized that (the very excellent) answers of this post lack of by and aggregate explanations. Here is my contribution.

BY

The by function, as stated in the documentation can be though, as a "wrapper" for tapply. The power of by arises when we want to compute a task that tapply can't handle. One example is this code:

ct <- tapply(iris$Sepal.Width , iris$Species , summary )
cb <- by(iris$Sepal.Width , iris$Species , summary )

 cb
iris$Species: setosa
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.300   3.200   3.400   3.428   3.675   4.400 
-------------------------------------------------------------- 
iris$Species: versicolor
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.000   2.525   2.800   2.770   3.000   3.400 
-------------------------------------------------------------- 
iris$Species: virginica
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.200   2.800   3.000   2.974   3.175   3.800 


ct
$setosa
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.300   3.200   3.400   3.428   3.675   4.400 

$versicolor
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.000   2.525   2.800   2.770   3.000   3.400 

$virginica
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.200   2.800   3.000   2.974   3.175   3.800 

If we print these two objects, ct and cb, we "essentially" have the same results and the only differences are in how they are shown and the different class attributes, respectively by for cb and array for ct.

As I've said, the power of by arises when we can't use tapply; the following code is one example:

 tapply(iris, iris$Species, summary )
Error in tapply(iris, iris$Species, summary) : 
  arguments must have same length

R says that arguments must have the same lengths, say "we want to calculate the summary of all variable in iris along the factor Species": but R just can't do that because it does not know how to handle.

With the by function R dispatch a specific method for data frame class and then let the summary function works even if the length of the first argument (and the type too) are different.

bywork <- by(iris, iris$Species, summary )

bywork
iris$Species: setosa
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
 Min.   :4.300   Min.   :2.300   Min.   :1.000   Min.   :0.100   setosa    :50  
 1st Qu.:4.800   1st Qu.:3.200   1st Qu.:1.400   1st Qu.:0.200   versicolor: 0  
 Median :5.000   Median :3.400   Median :1.500   Median :0.200   virginica : 0  
 Mean   :5.006   Mean   :3.428   Mean   :1.462   Mean   :0.246                  
 3rd Qu.:5.200   3rd Qu.:3.675   3rd Qu.:1.575   3rd Qu.:0.300                  
 Max.   :5.800   Max.   :4.400   Max.   :1.900   Max.   :0.600                  
-------------------------------------------------------------- 
iris$Species: versicolor
  Sepal.Length    Sepal.Width     Petal.Length   Petal.Width          Species  
 Min.   :4.900   Min.   :2.000   Min.   :3.00   Min.   :1.000   setosa    : 0  
 1st Qu.:5.600   1st Qu.:2.525   1st Qu.:4.00   1st Qu.:1.200   versicolor:50  
 Median :5.900   Median :2.800   Median :4.35   Median :1.300   virginica : 0  
 Mean   :5.936   Mean   :2.770   Mean   :4.26   Mean   :1.326                  
 3rd Qu.:6.300   3rd Qu.:3.000   3rd Qu.:4.60   3rd Qu.:1.500                  
 Max.   :7.000   Max.   :3.400   Max.   :5.10   Max.   :1.800                  
-------------------------------------------------------------- 
iris$Species: virginica
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
 Min.   :4.900   Min.   :2.200   Min.   :4.500   Min.   :1.400   setosa    : 0  
 1st Qu.:6.225   1st Qu.:2.800   1st Qu.:5.100   1st Qu.:1.800   versicolor: 0  
 Median :6.500   Median :3.000   Median :5.550   Median :2.000   virginica :50  
 Mean   :6.588   Mean   :2.974   Mean   :5.552   Mean   :2.026                  
 3rd Qu.:6.900   3rd Qu.:3.175   3rd Qu.:5.875   3rd Qu.:2.300                  
 Max.   :7.900   Max.   :3.800   Max.   :6.900   Max.   :2.500     

it works indeed and the result is very surprising. It is an object of class by that along Species (say, for each of them) computes the summary of each variable.

Note that if the first argument is a data frame, the dispatched function must have a method for that class of objects. For example is we use this code with the mean function we will have this code that has no sense at all:

 by(iris, iris$Species, mean)
iris$Species: setosa
[1] NA
------------------------------------------- 
iris$Species: versicolor
[1] NA
------------------------------------------- 
iris$Species: virginica
[1] NA
Warning messages:
1: In mean.default(data[x, , drop = FALSE], ...) :
  argument is not numeric or logical: returning NA
2: In mean.default(data[x, , drop = FALSE], ...) :
  argument is not numeric or logical: returning NA
3: In mean.default(data[x, , drop = FALSE], ...) :
  argument is not numeric or logical: returning NA

AGGREGATE

aggregate can be seen as another a different way of use tapply if we use it in such a way.

at <- tapply(iris$Sepal.Length , iris$Species , mean)
ag <- aggregate(iris$Sepal.Length , list(iris$Species), mean)

 at
    setosa versicolor  virginica 
     5.006      5.936      6.588 
 ag
     Group.1     x
1     setosa 5.006
2 versicolor 5.936
3  virginica 6.588

The two immediate differences are that the second argument of aggregate must be a list while tapply can (not mandatory) be a list and that the output of aggregate is a data frame while the one of tapply is an array.

The power of aggregate is that it can handle easily subsets of the data with subset argument and that it has methods for ts objects and formula as well.

These elements make aggregate easier to work with that tapply in some situations. Here are some examples (available in documentation):

ag <- aggregate(len ~ ., data = ToothGrowth, mean)

 ag
  supp dose   len
1   OJ  0.5 13.23
2   VC  0.5  7.98
3   OJ  1.0 22.70
4   VC  1.0 16.77
5   OJ  2.0 26.06
6   VC  2.0 26.14

We can achieve the same with tapply but the syntax is slightly harder and the output (in some circumstances) less readable:

att <- tapply(ToothGrowth$len, list(ToothGrowth$dose, ToothGrowth$supp), mean)

 att
       OJ    VC
0.5 13.23  7.98
1   22.70 16.77
2   26.06 26.14

There are other times when we can't use by or tapply and we have to use aggregate.

 ag1 <- aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, mean)

 ag1
  Month    Ozone     Temp
1     5 23.61538 66.73077
2     6 29.44444 78.22222
3     7 59.11538 83.88462
4     8 59.96154 83.96154
5     9 31.44828 76.89655

We cannot obtain the previous result with tapply in one call but we have to calculate the mean along Month for each elements and then combine them (also note that we have to call the na.rm = TRUE, because the formula methods of the aggregate function has by default the na.action = na.omit):

ta1 <- tapply(airquality$Ozone, airquality$Month, mean, na.rm = TRUE)
ta2 <- tapply(airquality$Temp, airquality$Month, mean, na.rm = TRUE)

 cbind(ta1, ta2)
       ta1      ta2
5 23.61538 65.54839
6 29.44444 79.10000
7 59.11538 83.90323
8 59.96154 83.96774
9 31.44828 76.90000

while with by we just can't achieve that in fact the following function call returns an error (but most likely it is related to the supplied function, mean):

by(airquality[c("Ozone", "Temp")], airquality$Month, mean, na.rm = TRUE)

Other times the results are the same and the differences are just in the class (and then how it is shown/printed and not only -- example, how to subset it) object:

byagg <- by(airquality[c("Ozone", "Temp")], airquality$Month, summary)
aggagg <- aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, summary)

The previous code achieve the same goal and results, at some points what tool to use is just a matter of personal tastes and needs; the previous two objects have very different needs in terms of subsetting.


In the collapse package recently released on CRAN, I have attempted to compress most of the common apply functionality into just 2 functions:

  1. dapply (Data-Apply) applies functions to rows or (default) columns of matrices and data.frames and (default) returns an object of the same type and with the same attributes (unless the result of each computation is atomic and drop = TRUE). The performance is comparable to lapply for data.frame columns, and about 2x faster than apply for matrix rows or columns. Parallelism is available via mclapply (only for MAC).

Syntax:

dapply(X, FUN, ..., MARGIN = 2, parallel = FALSE, mc.cores = 1L, 
       return = c("same", "matrix", "data.frame"), drop = TRUE)

Examples:

# Apply to columns:
dapply(mtcars, log)
dapply(mtcars, sum)
dapply(mtcars, quantile)
# Apply to rows:
dapply(mtcars, sum, MARGIN = 1)
dapply(mtcars, quantile, MARGIN = 1)
# Return as matrix:
dapply(mtcars, quantile, return = "matrix")
dapply(mtcars, quantile, MARGIN = 1, return = "matrix")
# Same for matrices ...
  1. BY is a S3 generic for split-apply-combine computing with vector, matrix and data.frame method. It is significantly faster than tapply, by and aggregate (an also faster than plyr, on large data dplyr is faster though).

Syntax:

BY(X, g, FUN, ..., use.g.names = TRUE, sort = TRUE,
   expand.wide = FALSE, parallel = FALSE, mc.cores = 1L,
   return = c("same", "matrix", "data.frame", "list"))

Examples:

# Vectors:
BY(iris$Sepal.Length, iris$Species, sum)
BY(iris$Sepal.Length, iris$Species, quantile)
BY(iris$Sepal.Length, iris$Species, quantile, expand.wide = TRUE) # This returns a matrix 
# Data.frames
BY(iris[-5], iris$Species, sum)
BY(iris[-5], iris$Species, quantile)
BY(iris[-5], iris$Species, quantile, expand.wide = TRUE) # This returns a wider data.frame
BY(iris[-5], iris$Species, quantile, return = "matrix") # This returns a matrix
# Same for matrices ...

Lists of grouping variables can also be supplied to g.

Talking about performance: A main goal of collapse is to foster high-performance programming in R and to move beyond split-apply-combine alltogether. For this purpose the package has a full set of C++ based fast generic functions: fmean, fmedian, fmode, fsum, fprod, fsd, fvar, fmin, fmax, ffirst, flast, fNobs, fNdistinct, fscale, fbetween, fwithin, fHDbetween, fHDwithin, flag, fdiff and fgrowth. They perform grouped computations in a single pass through the data (i.e. no splitting and recombining).

Syntax:

fFUN(x, g = NULL, [w = NULL,] TRA = NULL, [na.rm = TRUE,] use.g.names = TRUE, drop = TRUE)

Examples:

v <- iris$Sepal.Length
f <- iris$Species

# Vectors
fmean(v)             # mean
fmean(v, f)          # grouped mean
fsd(v, f)            # grouped standard deviation
fsd(v, f, TRA = "/") # grouped scaling
fscale(v, f)         # grouped standardizing (scaling and centering)
fwithin(v, f)        # grouped demeaning

w <- abs(rnorm(nrow(iris)))
fmean(v, w = w)      # Weighted mean
fmean(v, f, w)       # Weighted grouped mean
fsd(v, f, w)         # Weighted grouped standard-deviation
fsd(v, f, w, "/")    # Weighted grouped scaling
fscale(v, f, w)      # Weighted grouped standardizing
fwithin(v, f, w)     # Weighted grouped demeaning

# Same using data.frames...
fmean(iris[-5], f)                # grouped mean
fscale(iris[-5], f)               # grouped standardizing
fwithin(iris[-5], f)              # grouped demeaning

# Same with matrices ...

In the package vignettes I provide benchmarks. Programming with the fast functions is significantly faster than programming with dplyr or data.table, especially on smaller data, but also on large data.


Examples related to r

How to get AIC from Conway–Maxwell-Poisson regression via COM-poisson package in R? R : how to simply repeat a command? session not created: This version of ChromeDriver only supports Chrome version 74 error with ChromeDriver Chrome using Selenium How to show code but hide output in RMarkdown? remove kernel on jupyter notebook Function to calculate R2 (R-squared) in R Center Plot title in ggplot2 R ggplot2: stat_count() must not be used with a y aesthetic error in Bar graph R multiple conditions in if statement What does "The following object is masked from 'package:xxx'" mean?

Examples related to lapply

How to index an element of a list object in R passing several arguments to FUN of lapply (and others *apply) Read all files in a folder and apply a function to each data frame Grouping functions (tapply, by, aggregate) and the *apply family

Examples related to sapply

Apply function to each column in a data frame observing each columns existing data type Apply a function to every row of a matrix or a data frame Grouping functions (tapply, by, aggregate) and the *apply family

Examples related to tapply

Grouping functions (tapply, by, aggregate) and the *apply family

Examples related to r-faq

What does "The following object is masked from 'package:xxx'" mean? What does "Error: object '<myvariable>' not found" mean? How do I deal with special characters like \^$.?*|+()[{ in my regex? What does %>% function mean in R? How to plot a function curve in R Use dynamic variable names in `dplyr` Error: unexpected symbol/input/string constant/numeric constant/SPECIAL in my code How should I deal with "package 'xxx' is not available (for R version x.y.z)" warning? How to select the row with the maximum value in each group R data formats: RData, Rda, Rds etc