In R, I have an operation which creates some Inf
values when I transform a dataframe.
I would like to turn these Inf
values into NA
values. The code I have is slow for large data, is there a faster way of doing this?
Say I have the following dataframe:
dat <- data.frame(a=c(1, Inf), b=c(Inf, 3), d=c("a","b"))
The following works in a single case:
dat[,1][is.infinite(dat[,1])] = NA
So I generalized it with following loop
cf_DFinf2NA <- function(x)
{
for (i in 1:ncol(x)){
x[,i][is.infinite(x[,i])] = NA
}
return(x)
}
But I don't think that this is really using the power of R.
This question is related to
r
dataframe
data.table
Another solution:
dat <- data.frame(a = rep(c(1,Inf), 1e6), b = rep(c(Inf,2), 1e6),
c = rep(c('a','b'),1e6),d = rep(c(1,Inf), 1e6),
e = rep(c(Inf,2), 1e6))
system.time(dat[dat==Inf] <- NA)
# user system elapsed
# 0.316 0.024 0.340
Feng Mai has a tidyverse answer above to get negative and positive infinities:
dat %>% mutate_if(is.numeric, list(~na_if(., Inf))) %>%
mutate_if(is.numeric, list(~na_if(., -Inf)))
This works well, but a word of warning is not to swap in abs(.) here to do both lines at once as is proposed in an upvoted comment. It will look like it works, but changes all negative values in the dataset to positive! You can confirm with this:
data(iris)
#The last line here is bad - it converts all negative values to positive
iris %>%
mutate_if(is.numeric, ~scale(.)) %>%
mutate(infinities = Sepal.Length / 0) %>%
mutate_if(is.numeric, list(~na_if(abs(.), Inf)))
For one line, this works:
mutate_if(is.numeric, ~ifelse(abs(.) == Inf,NA,.))
[<-
with mapply
is a bit faster than sapply
.
> dat[mapply(is.infinite, dat)] <- NA
With mnel's data, the timing is
> system.time(dat[mapply(is.infinite, dat)] <- NA)
# user system elapsed
# 15.281 0.000 13.750
There is very simple solution to this problem in the hablar package:
library(hablar)
dat %>% rationalize()
Which return a data frame with all Inf are converted to NA.
Timings compared to some above solutions. Code: library(hablar) library(data.table)
dat <- data.frame(a = rep(c(1,Inf), 1e6), b = rep(c(Inf,2), 1e6),
c = rep(c('a','b'),1e6),d = rep(c(1,Inf), 1e6),
e = rep(c(Inf,2), 1e6))
DT <- data.table(dat)
system.time(dat[mapply(is.infinite, dat)] <- NA)
system.time(dat[dat==Inf] <- NA)
system.time(invisible(lapply(names(DT),function(.name) set(DT, which(is.infinite(DT[[.name]])), j = .name,value =NA))))
system.time(rationalize(dat))
Result:
> system.time(dat[mapply(is.infinite, dat)] <- NA)
user system elapsed
0.125 0.039 0.164
> system.time(dat[dat==Inf] <- NA)
user system elapsed
0.095 0.010 0.108
> system.time(invisible(lapply(names(DT),function(.name) set(DT, which(is.infinite(DT[[.name]])), j = .name,value =NA))))
user system elapsed
0.065 0.002 0.067
> system.time(rationalize(dat))
user system elapsed
0.058 0.014 0.072
>
Seems like data.table is faster than hablar. But has longer syntax.
Use sapply
and is.na<-
> dat <- data.frame(a=c(1, Inf), b=c(Inf, 3), d=c("a","b"))
> is.na(dat) <- sapply(dat, is.infinite)
> dat
a b d
1 1 NA a
2 NA 3 b
Or you can use (giving credit to @mnel, whose edit this is),
> is.na(dat) <- do.call(cbind,lapply(dat, is.infinite))
which is significantly faster.
Also, if someone need the Infs' coordinates, can do this:
library(rlist)
list.clean(apply(df, 2, function(x){which(is.infinite(x))}), function(x) length(x) == 0L, TRUE)
Result:
$colname1
[1] row1 row2 ...
$colname2
[2] row1 row2 ...
With this information, you can replace the Inf values in particular places with the mean, median, or whatever operator that you want.
For example (for element 01):
repInf = list.clean(apply(df, 2, function(x){which(is.infinite(x))}), function(x) length(x) == 0L, TRUE)
df[repInf[[1]], names(repInf)[[1]]] = median or mean(is.finite(df[ ,names(repInf)[[1]]]), na.rm = TRUE)
In loop:
for (nonInf in 1:length(repInf)) {
df[repInf[[nonInf]], names(repInf)[[nonInf]]] = mean(is.finite(df[ , names(repInf)[[nonInf]]]))
}
You may also use the handy replace_na function: https://tidyr.tidyverse.org/reference/replace_na.html
Here is a dplyr/tidyverse solution using the na_if() function:
dat %>% mutate_if(is.numeric, list(~na_if(., Inf)))
Note that this only replaces positive infinity with NA. Need to repeat if negative infinity values also need to be replaced.
dat %>% mutate_if(is.numeric, list(~na_if(., Inf))) %>%
mutate_if(is.numeric, list(~na_if(., -Inf)))
Source: Stackoverflow.com