My question involves summing up values across multiple columns of a data frame and creating a new column corresponding to this summation using dplyr
. The data entries in the columns are binary(0,1). I am thinking of a row-wise analog of the summarise_each
or mutate_each
function of dplyr
. Below is a minimal example of the data frame:
library(dplyr)
df=data.frame(
x1=c(1,0,0,NA,0,1,1,NA,0,1),
x2=c(1,1,NA,1,1,0,NA,NA,0,1),
x3=c(0,1,0,1,1,0,NA,NA,0,1),
x4=c(1,0,NA,1,0,0,NA,0,0,1),
x5=c(1,1,NA,1,1,1,NA,1,0,1))
> df
x1 x2 x3 x4 x5
1 1 1 0 1 1
2 0 1 1 0 1
3 0 NA 0 NA NA
4 NA 1 1 1 1
5 0 1 1 0 1
6 1 0 0 0 1
7 1 NA NA NA NA
8 NA NA NA 0 1
9 0 0 0 0 0
10 1 1 1 1 1
I could use something like:
df <- df %>% mutate(sumrow= x1 + x2 + x3 + x4 + x5)
but this would involve writing out the names of each of the columns. I have like 50 columns. In addition, the column names change at different iterations of the loop in which I want to implement this operation so I would like to try avoid having to give any column names.
How can I do that most efficiently? Any assistance would be greatly appreciated.
sum up each row using rowSums
(rowwise
works for any aggreation, but is slower)
df %>%
replace(is.na(.), 0) %>%
mutate(sum = rowSums(across(where(is.numeric))))
sum down each column
df %>%
summarise(across(everything(), ~ sum(., is.na(.), 0)))
sum up each row
df %>%
replace(is.na(.), 0) %>%
mutate(sum = rowSums(.[1:5]))
sum down each column using superseeded summarise_all
:
df %>%
replace(is.na(.), 0) %>%
summarise_all(funs(sum))
If you want to sum certain columns only, I'd use something like this:
library(dplyr)
df=data.frame(
x1=c(1,0,0,NA,0,1,1,NA,0,1),
x2=c(1,1,NA,1,1,0,NA,NA,0,1),
x3=c(0,1,0,1,1,0,NA,NA,0,1),
x4=c(1,0,NA,1,0,0,NA,0,0,1),
x5=c(1,1,NA,1,1,1,NA,1,0,1))
df %>% select(x3:x5) %>% rowSums(na.rm=TRUE) -> df$x3x5.total
head(df)
This way you can use dplyr::select
's syntax.
Using reduce()
from purrr
is slightly faster than rowSums
and definately faster than apply
, since you avoid iterating over all the rows and just take advantage of the vectorized operations:
library(purrr)
library(dplyr)
iris %>% mutate(Petal = reduce(select(., starts_with("Petal")), `+`))
See this for timings
I would use regular expression matching to sum over variables with certain pattern names. For example:
df <- df %>% mutate(sum1 = rowSums(.[grep("x[3-5]", names(.))], na.rm = TRUE),
sum_all = rowSums(.[grep("x", names(.))], na.rm = TRUE))
This way you can create more than one variable as a sum of certain group of variables of your data frame.
I encounter this problem often, and the easiest way to do this is to use the apply()
function within a mutate
command.
library(tidyverse)
df=data.frame(
x1=c(1,0,0,NA,0,1,1,NA,0,1),
x2=c(1,1,NA,1,1,0,NA,NA,0,1),
x3=c(0,1,0,1,1,0,NA,NA,0,1),
x4=c(1,0,NA,1,0,0,NA,0,0,1),
x5=c(1,1,NA,1,1,1,NA,1,0,1))
df %>%
mutate(sum = select(., x1:x5) %>% apply(1, sum, na.rm=TRUE))
Here you could use whatever you want to select the columns using the standard dplyr
tricks (e.g. starts_with()
or contains()
). By doing all the work within a single mutate
command, this action can occur anywhere within a dplyr
stream of processing steps. Finally, by using the apply()
function, you have the flexibility to use whatever summary you need, including your own purpose built summarization function.
Alternatively, if the idea of using a non-tidyverse function is unappealing, then you could gather up the columns, summarize them and finally join the result back to the original data frame.
df <- df %>% mutate( id = 1:n() ) # Need some ID column for this to work
df <- df %>%
group_by(id) %>%
gather('Key', 'value', starts_with('x')) %>%
summarise( Key.Sum = sum(value) ) %>%
left_join( df, . )
Here I used the starts_with()
function to select the columns and calculated the sum and you can do whatever you want with NA
values. The downside to this approach is that while it is pretty flexible, it doesn't really fit into a dplyr
stream of data cleaning steps.
dplyr >= 1.0.0
In newer versions of dplyr
you can use rowwise()
along with c_across
to perform row-wise aggregation for functions that do not have specific row-wise variants, but if the row-wise variant exists it should be faster.
Since rowwise()
is just a special form of grouping and changes the way verbs work you'll likely want to pipe it to ungroup()
after doing your row-wise operation.
To select a range of rows:
df %>%
dplyr::rowwise() %>%
dplyr::mutate(sumrange = sum(dplyr::c_across(x1:x5), na.rm = T))
# %>% dplyr::ungroup() # you'll likely want to ungroup after using rowwise()
To select rows by type:
df %>%
dplyr::rowwise() %>%
dplyr::mutate(sumnumeric = sum(c_across(where(is.numeric)), na.rm = T))
# %>% dplyr::ungroup() # you'll likely want to ungroup after using rowwise()
In your specific case a row-wise variant exists so you can do the following (note the use of across
instead):
df %>%
dplyr::mutate(sumrow = rowSums(dplyr::across(x1:x5), na.rm = T))
For more information see the page on rowwise.
Source: Stackoverflow.com