Since you ask for other better ways to handle the problem, here's another way using data.table
:
require(data.table) ## 1.9.2+
setDT(df)
df[a %in% c(0,1,3,4) | c == 4, g := 3L]
df[a %in% c(2,5,7) | (a==1 & b==4), g := 2L]
Note the order of conditional statements is reversed to get g
correctly. There's no copy of g
made, even during the second assignment - it's replaced in-place.
On larger data this would have better performance than using nested if-else
, as it can evaluate both 'yes' and 'no' cases, and nesting can get harder to read/maintain IMHO.
Here's a benchmark on relatively bigger data:
# R version 3.1.0
require(data.table) ## 1.9.2
require(dplyr)
DT <- setDT(lapply(1:6, function(x) sample(7, 1e7, TRUE)))
setnames(DT, letters[1:6])
# > dim(DT)
# [1] 10000000 6
DF <- as.data.frame(DT)
DT_fun <- function(DT) {
DT[(a %in% c(0,1,3,4) | c == 4), g := 3L]
DT[a %in% c(2,5,7) | (a==1 & b==4), g := 2L]
}
DPLYR_fun <- function(DF) {
mutate(DF, g = ifelse(a %in% c(2,5,7) | (a==1 & b==4), 2L,
ifelse(a %in% c(0,1,3,4) | c==4, 3L, NA_integer_)))
}
BASE_fun <- function(DF) { # R v3.1.0
transform(DF, g = ifelse(a %in% c(2,5,7) | (a==1 & b==4), 2L,
ifelse(a %in% c(0,1,3,4) | c==4, 3L, NA_integer_)))
}
system.time(ans1 <- DT_fun(DT))
# user system elapsed
# 2.659 0.420 3.107
system.time(ans2 <- DPLYR_fun(DF))
# user system elapsed
# 11.822 1.075 12.976
system.time(ans3 <- BASE_fun(DF))
# user system elapsed
# 11.676 1.530 13.319
identical(as.data.frame(ans1), as.data.frame(ans2))
# [1] TRUE
identical(as.data.frame(ans1), as.data.frame(ans3))
# [1] TRUE
Not sure if this is an alternative you'd asked for, but I hope it helps.