Can the mutate be used when the mutation is conditional (depending on the values of certain column values)?
This example helps showing what I mean.
structure(list(a = c(1, 3, 4, 6, 3, 2, 5, 1), b = c(1, 3, 4,
2, 6, 7, 2, 6), c = c(6, 3, 6, 5, 3, 6, 5, 3), d = c(6, 2, 4,
5, 3, 7, 2, 6), e = c(1, 2, 4, 5, 6, 7, 6, 3), f = c(2, 3, 4,
2, 2, 7, 5, 2)), .Names = c("a", "b", "c", "d", "e", "f"), row.names = c(NA,
8L), class = "data.frame")
a b c d e f
1 1 1 6 6 1 2
2 3 3 3 2 2 3
3 4 4 6 4 4 4
4 6 2 5 5 5 2
5 3 6 3 3 6 2
6 2 7 6 7 7 7
7 5 2 5 2 6 5
8 1 6 3 6 3 2
I was hoping to find a solution to my problem using the dplyr package (and yes I know this not code that should work, but I guess it makes the purpose clear) for creating a new column g:
library(dplyr)
df <- mutate(df,
if (a == 2 | a == 5 | a == 7 | (a == 1 & b == 4)){g = 2},
if (a == 0 | a == 1 | a == 4 | a == 3 | c == 4) {g = 3})
The result of the code I am looking for should have this result in this particular example:
a b c d e f g
1 1 1 6 6 1 2 3
2 3 3 3 2 2 3 3
3 4 4 6 4 4 4 3
4 6 2 5 5 5 2 NA
5 3 6 3 3 6 2 NA
6 2 7 6 7 7 7 2
7 5 2 5 2 6 5 2
8 1 6 3 6 3 2 3
Does anyone have an idea about how to do this in dplyr? This data frame is just an example, the data frames I am dealing with are much larger. Because of its speed I tried to use dplyr, but perhaps there are other, better ways to handle this problem?
This question is related to
r
if-statement
dplyr
case-when
mutate
dplyr now has a function case_when
that offers a vectorised if. The syntax is a little strange compared to mosaic:::derivedFactor
as you cannot access variables in the standard dplyr way, and need to declare the mode of NA, but it is considerably faster than mosaic:::derivedFactor
.
df %>%
mutate(g = case_when(a %in% c(2,5,7) | (a==1 & b==4) ~ 2L,
a %in% c(0,1,3,4) | c == 4 ~ 3L,
TRUE~as.integer(NA)))
EDIT: If you're using dplyr::case_when()
from before version 0.7.0 of the package, then you need to precede variable names with '.$
' (e.g. write .$a == 1
inside case_when
).
Benchmark: For the benchmark (reusing functions from Arun 's post) and reducing sample size:
require(data.table)
require(mosaic)
require(dplyr)
require(microbenchmark)
set.seed(42) # To recreate the dataframe
DT <- setDT(lapply(1:6, function(x) sample(7, 10000, TRUE)))
setnames(DT, letters[1:6])
DF <- as.data.frame(DT)
DPLYR_case_when <- function(DF) {
DF %>%
mutate(g = case_when(a %in% c(2,5,7) | (a==1 & b==4) ~ 2L,
a %in% c(0,1,3,4) | c==4 ~ 3L,
TRUE~as.integer(NA)))
}
DT_fun <- function(DT) {
DT[(a %in% c(0,1,3,4) | c == 4), g := 3L]
DT[a %in% c(2,5,7) | (a==1 & b==4), g := 2L]
}
DPLYR_fun <- function(DF) {
mutate(DF, g = ifelse(a %in% c(2,5,7) | (a==1 & b==4), 2L,
ifelse(a %in% c(0,1,3,4) | c==4, 3L, NA_integer_)))
}
mosa_fun <- function(DF) {
mutate(DF, g = derivedFactor(
"2" = (a == 2 | a == 5 | a == 7 | (a == 1 & b == 4)),
"3" = (a == 0 | a == 1 | a == 4 | a == 3 | c == 4),
.method = "first",
.default = NA
))
}
perf_results <- microbenchmark(
dt_fun <- DT_fun(copy(DT)),
dplyr_ifelse <- DPLYR_fun(copy(DF)),
dplyr_case_when <- DPLYR_case_when(copy(DF)),
mosa <- mosa_fun(copy(DF)),
times = 100L
)
This gives:
print(perf_results)
Unit: milliseconds
expr min lq mean median uq max neval
dt_fun 1.391402 1.560751 1.658337 1.651201 1.716851 2.383801 100
dplyr_ifelse 1.172601 1.230351 1.331538 1.294851 1.390351 1.995701 100
dplyr_case_when 1.648201 1.768002 1.860968 1.844101 1.958801 2.207001 100
mosa 255.591301 281.158350 291.391586 286.549802 292.101601 545.880702 100
case_when
is now a pretty clean implementation of the SQL-style case when:
structure(list(a = c(1, 3, 4, 6, 3, 2, 5, 1), b = c(1, 3, 4,
2, 6, 7, 2, 6), c = c(6, 3, 6, 5, 3, 6, 5, 3), d = c(6, 2, 4,
5, 3, 7, 2, 6), e = c(1, 2, 4, 5, 6, 7, 6, 3), f = c(2, 3, 4,
2, 2, 7, 5, 2)), .Names = c("a", "b", "c", "d", "e", "f"), row.names = c(NA,
8L), class = "data.frame") -> df
df %>%
mutate( g = case_when(
a == 2 | a == 5 | a == 7 | (a == 1 & b == 4 ) ~ 2,
a == 0 | a == 1 | a == 4 | a == 3 | c == 4 ~ 3
))
Using dplyr 0.7.4
The manual: http://dplyr.tidyverse.org/reference/case_when.html
The derivedFactor
function from mosaic
package seems to be designed to handle this. Using this example, it would look like:
library(dplyr)
library(mosaic)
df <- mutate(df, g = derivedFactor(
"2" = (a == 2 | a == 5 | a == 7 | (a == 1 & b == 4)),
"3" = (a == 0 | a == 1 | a == 4 | a == 3 | c == 4),
.method = "first",
.default = NA
))
(If you want the result to be numeric instead of a factor, you can wrap derivedFactor
in an as.numeric
call.)
derivedFactor
can be used for an arbitrary number of conditionals, too.
Since you ask for other better ways to handle the problem, here's another way using data.table
:
require(data.table) ## 1.9.2+
setDT(df)
df[a %in% c(0,1,3,4) | c == 4, g := 3L]
df[a %in% c(2,5,7) | (a==1 & b==4), g := 2L]
Note the order of conditional statements is reversed to get g
correctly. There's no copy of g
made, even during the second assignment - it's replaced in-place.
On larger data this would have better performance than using nested if-else
, as it can evaluate both 'yes' and 'no' cases, and nesting can get harder to read/maintain IMHO.
Here's a benchmark on relatively bigger data:
# R version 3.1.0
require(data.table) ## 1.9.2
require(dplyr)
DT <- setDT(lapply(1:6, function(x) sample(7, 1e7, TRUE)))
setnames(DT, letters[1:6])
# > dim(DT)
# [1] 10000000 6
DF <- as.data.frame(DT)
DT_fun <- function(DT) {
DT[(a %in% c(0,1,3,4) | c == 4), g := 3L]
DT[a %in% c(2,5,7) | (a==1 & b==4), g := 2L]
}
DPLYR_fun <- function(DF) {
mutate(DF, g = ifelse(a %in% c(2,5,7) | (a==1 & b==4), 2L,
ifelse(a %in% c(0,1,3,4) | c==4, 3L, NA_integer_)))
}
BASE_fun <- function(DF) { # R v3.1.0
transform(DF, g = ifelse(a %in% c(2,5,7) | (a==1 & b==4), 2L,
ifelse(a %in% c(0,1,3,4) | c==4, 3L, NA_integer_)))
}
system.time(ans1 <- DT_fun(DT))
# user system elapsed
# 2.659 0.420 3.107
system.time(ans2 <- DPLYR_fun(DF))
# user system elapsed
# 11.822 1.075 12.976
system.time(ans3 <- BASE_fun(DF))
# user system elapsed
# 11.676 1.530 13.319
identical(as.data.frame(ans1), as.data.frame(ans2))
# [1] TRUE
identical(as.data.frame(ans1), as.data.frame(ans3))
# [1] TRUE
Not sure if this is an alternative you'd asked for, but I hope it helps.
Source: Stackoverflow.com