Sum across multiple columns with dplyr

Question

My question involves summing up values across multiple columns of a data frame and creating a new column corresponding to this summation using dplyr  The data entries in the columns are binary 0 1   I am thinking of a row-wise analog of the summarise each or mutate each function of dplyr  Below is a minimal example of the data frame   library dplyr  df data frame    x1 c 1 0 0 NA 0 1 1 NA 0 1     x2 c 1 1 NA 1 1 0 NA NA 0 1     x3 c 0 1 0 1 1 0 NA NA 0 1     x4 c 1 0 NA 1 0 0 NA 0 0 1     x5 c 1 1 NA 1 1 1 NA 1 0 1     gt  df    x1 x2 x3 x4 x5 1   1  1  0  1  1 2   0  1  1  0  1 3   0 NA  0 NA NA 4  NA  1  1  1  1 5   0  1  1  0  1 6   1  0  0  0  1 7   1 NA NA NA NA 8  NA NA NA  0  1 9   0  0  0  0  0 10  1  1  1  1  1   I could use something like   df  lt - df   gt   mutate sumrow  x1   x2   x3   x4   x5    but this would involve writing out the names of each of the columns  I have like 50 columns  In addition  the column names change at different iterations of the loop in which I want to implement this operation so I would like to try avoid having to give any column names   How can I do that most efficiently  Any assistance would be greatly appreciated

User · Answer

Using reduce   from purrr is slightly faster than rowSums and definately faster than apply  since you avoid iterating over all the rows and just take advantage of the vectorized operations   library purrr  library dplyr  iris   gt   mutate Petal   reduce select    starts with  Petal             See this for timings

User · Answer

I would use regular expression matching to sum over variables with certain pattern names  For example   df  lt - df   gt   mutate sum1   rowSums   grep  x 3-5    names       na rm   TRUE                       sum all   rowSums   grep  x   names       na rm   TRUE     This way you can create more than one variable as a sum of certain group of variables of your data frame

User · Answer

I encounter this problem often  and the easiest way to do this is to use the apply   function within a mutate command   library tidyverse  df data frame    x1 c 1 0 0 NA 0 1 1 NA 0 1     x2 c 1 1 NA 1 1 0 NA NA 0 1     x3 c 0 1 0 1 1 0 NA NA 0 1     x4 c 1 0 NA 1 0 0 NA 0 0 1     x5 c 1 1 NA 1 1 1 NA 1 0 1    df   gt     mutate sum   select    x1 x5    gt   apply 1  sum  na rm TRUE     Here you could use whatever you want to select the columns using the standard dplyr tricks  e g  starts with   or contains      By doing all the work within a single mutate command  this action can occur anywhere within a dplyr stream of processing steps  Finally  by using the apply   function  you have the flexibility to use whatever summary you need  including your own purpose built summarization function    Alternatively  if the idea of using a non-tidyverse function is unappealing  then you could gather up the columns  summarize them and finally join the result back to the original data frame   df  lt - df   gt   mutate  id   1 n         Need some ID column for this to work  df  lt - df   gt     group by id    gt     gather  Key    value   starts with  x      gt     summarise  Key Sum   sum value      gt     left join  df        Here I used the starts with   function to select the columns and calculated the sum and you can do whatever you want with NA values   The downside to this approach is that while it is pretty flexible  it doesn t really fit into a dplyr stream of data cleaning steps

User · Answer

If you want to sum certain columns only  I d use something like this   library dplyr  df data frame    x1 c 1 0 0 NA 0 1 1 NA 0 1     x2 c 1 1 NA 1 1 0 NA NA 0 1     x3 c 0 1 0 1 1 0 NA NA 0 1     x4 c 1 0 NA 1 0 0 NA 0 0 1     x5 c 1 1 NA 1 1 1 NA 1 0 1   df   gt   select x3 x5    gt   rowSums na rm TRUE  - gt  df x3x5 total head df    This way you can use dplyr  select s syntax

User · Answer

dplyr  gt   1 0 0 using across sum up each row using rowSums  rowwise works for any aggreation  but is slower  df   gt      replace is na     0    gt      mutate sum   rowSums across where is numeric      sum down each column df   gt      summarise across everything      sum    is na     0     dplyr  lt  1 0 0 sum up each row df   gt      replace is na     0    gt      mutate sum   rowSums   1 5     sum down each column using superseeded summarise all  df   gt      replace is na     0    gt      summarise all funs sum

User · Answer

dplyr  gt   1 0 0 In newer versions of dplyr you can use rowwise   along with c across to perform row-wise aggregation for functions that do not have specific row-wise variants  but if the row-wise variant exists it should be faster  Since rowwise   is just a special form of grouping and changes the way verbs work you ll likely want to pipe it to ungroup   after doing your row-wise operation  To select a range of rows  df   gt     dplyr  rowwise     gt      dplyr  mutate sumrange   sum dplyr  c across x1 x5   na rm   T       gt   dplyr  ungroup     you ll likely want to ungroup after using rowwise    To select rows by type  df   gt     dplyr  rowwise     gt      dplyr  mutate sumnumeric   sum c across where is numeric    na rm   T       gt   dplyr  ungroup     you ll likely want to ungroup after using rowwise    In your specific case a row-wise variant exists so you can do the following  note the use of across instead   df   gt     dplyr  mutate sumrow   rowSums dplyr  across x1 x5   na rm   T    For more information see the page on rowwise

[r] Sum across multiple columns with dplyr

Examples related to r

Examples related to dplyr