[r] Cleaning `Inf` values from an R dataframe

In R, I have an operation which creates some Inf values when I transform a dataframe.

I would like to turn these Inf values into NA values. The code I have is slow for large data, is there a faster way of doing this?

Say I have the following dataframe:

dat <- data.frame(a=c(1, Inf), b=c(Inf, 3), d=c("a","b"))

The following works in a single case:

 dat[,1][is.infinite(dat[,1])] = NA

So I generalized it with following loop

cf_DFinf2NA <- function(x)
{
    for (i in 1:ncol(x)){
          x[,i][is.infinite(x[,i])] = NA
    }
    return(x)
}

But I don't think that this is really using the power of R.

This question is related to r dataframe data.table

The answer is


Another solution:

    dat <- data.frame(a = rep(c(1,Inf), 1e6), b = rep(c(Inf,2), 1e6), 
                      c = rep(c('a','b'),1e6),d = rep(c(1,Inf), 1e6),  
                      e = rep(c(Inf,2), 1e6))
    system.time(dat[dat==Inf] <- NA)

#   user  system elapsed
#  0.316   0.024   0.340

Feng Mai has a tidyverse answer above to get negative and positive infinities:

dat %>% mutate_if(is.numeric, list(~na_if(., Inf))) %>% 
  mutate_if(is.numeric, list(~na_if(., -Inf)))

This works well, but a word of warning is not to swap in abs(.) here to do both lines at once as is proposed in an upvoted comment. It will look like it works, but changes all negative values in the dataset to positive! You can confirm with this:

data(iris)
#The last line here is bad - it converts all negative values to positive
iris %>% 
  mutate_if(is.numeric, ~scale(.)) %>%
  mutate(infinities = Sepal.Length / 0) %>%
  mutate_if(is.numeric, list(~na_if(abs(.), Inf)))

For one line, this works:

  mutate_if(is.numeric, ~ifelse(abs(.) == Inf,NA,.))

[<- with mapply is a bit faster than sapply.

> dat[mapply(is.infinite, dat)] <- NA

With mnel's data, the timing is

> system.time(dat[mapply(is.infinite, dat)] <- NA)
#   user  system elapsed 
# 15.281   0.000  13.750 

There is very simple solution to this problem in the hablar package:

library(hablar)

dat %>% rationalize()

Which return a data frame with all Inf are converted to NA.

Timings compared to some above solutions. Code: library(hablar) library(data.table)

dat <- data.frame(a = rep(c(1,Inf), 1e6), b = rep(c(Inf,2), 1e6), 
                  c = rep(c('a','b'),1e6),d = rep(c(1,Inf), 1e6),  
                  e = rep(c(Inf,2), 1e6))
DT <- data.table(dat)

system.time(dat[mapply(is.infinite, dat)] <- NA)
system.time(dat[dat==Inf] <- NA)
system.time(invisible(lapply(names(DT),function(.name) set(DT, which(is.infinite(DT[[.name]])), j = .name,value =NA))))
system.time(rationalize(dat))

Result:

> system.time(dat[mapply(is.infinite, dat)] <- NA)
   user  system elapsed 
  0.125   0.039   0.164 
> system.time(dat[dat==Inf] <- NA)
   user  system elapsed 
  0.095   0.010   0.108 
> system.time(invisible(lapply(names(DT),function(.name) set(DT, which(is.infinite(DT[[.name]])), j = .name,value =NA))))
   user  system elapsed 
  0.065   0.002   0.067 
> system.time(rationalize(dat))
   user  system elapsed 
  0.058   0.014   0.072 
> 

Seems like data.table is faster than hablar. But has longer syntax.


Use sapply and is.na<-

> dat <- data.frame(a=c(1, Inf), b=c(Inf, 3), d=c("a","b"))
> is.na(dat) <- sapply(dat, is.infinite)
> dat
   a  b d
1  1 NA a
2 NA  3 b

Or you can use (giving credit to @mnel, whose edit this is),

> is.na(dat) <- do.call(cbind,lapply(dat, is.infinite))

which is significantly faster.


Also, if someone need the Infs' coordinates, can do this:

library(rlist)
list.clean(apply(df, 2, function(x){which(is.infinite(x))}), function(x) length(x) == 0L, TRUE)

Result:

$colname1
[1] row1 row2 ...
$colname2
[2] row1 row2 ... 

With this information, you can replace the Inf values in particular places with the mean, median, or whatever operator that you want.

For example (for element 01):

repInf = list.clean(apply(df, 2, function(x){which(is.infinite(x))}), function(x) length(x) == 0L, TRUE)
df[repInf[[1]], names(repInf)[[1]]] = median or mean(is.finite(df[ ,names(repInf)[[1]]]), na.rm = TRUE)

In loop:

for (nonInf in 1:length(repInf)) {
df[repInf[[nonInf]], names(repInf)[[nonInf]]] = mean(is.finite(df[ , names(repInf)[[nonInf]]]))
}

You may also use the handy replace_na function: https://tidyr.tidyverse.org/reference/replace_na.html


Here is a dplyr/tidyverse solution using the na_if() function:

dat %>% mutate_if(is.numeric, list(~na_if(., Inf)))

Note that this only replaces positive infinity with NA. Need to repeat if negative infinity values also need to be replaced.

dat %>% mutate_if(is.numeric, list(~na_if(., Inf))) %>% 
  mutate_if(is.numeric, list(~na_if(., -Inf)))

Examples related to r

How to get AIC from Conway–Maxwell-Poisson regression via COM-poisson package in R? R : how to simply repeat a command? session not created: This version of ChromeDriver only supports Chrome version 74 error with ChromeDriver Chrome using Selenium How to show code but hide output in RMarkdown? remove kernel on jupyter notebook Function to calculate R2 (R-squared) in R Center Plot title in ggplot2 R ggplot2: stat_count() must not be used with a y aesthetic error in Bar graph R multiple conditions in if statement What does "The following object is masked from 'package:xxx'" mean?

Examples related to dataframe

Trying to merge 2 dataframes but get ValueError How to show all of columns name on pandas dataframe? Python Pandas - Find difference between two data frames Pandas get the most frequent values of a column Display all dataframe columns in a Jupyter Python Notebook How to convert column with string type to int form in pyspark data frame? Display/Print one column from a DataFrame of Series in Pandas Binning column with python pandas Selection with .loc in python Set value to an entire column of a pandas dataframe

Examples related to data.table

Error: package or namespace load failed for ggplot2 and for data.table Select subset of columns in data.table R How to get week numbers from dates? data.table vs dplyr: can one do something well the other can't or does poorly? How to replace NA values in a table for selected columns Select multiple columns in data.table by their numeric indices Sort rows in data.table in decreasing order on string key `order(-x,v)` gives error on data.table 1.9.4 or earlier Cleaning `Inf` values from an R dataframe Aggregate / summarize multiple variables per group (e.g. sum, mean) How do you delete a column by name in data.table?