I have a huge vector which has a couple of NA
values, and I'm trying to find the max value in that vector (the vector is all numbers), but I can't do this because of the NA
values.
How can I remove the NA
values so that I can compute the max?
This question is related to
r
max
min
na
missing-data
I ran a quick benchmark comparing the two base
approaches and it turns out that x[!is.na(x)]
is faster than na.omit
. User qwr
suggested I try purrr::dicard
also - this turned out to be massively slower (though I'll happily take comments on my implementation & test!)
microbenchmark::microbenchmark(
purrr::map(airquality,function(x) {x[!is.na(x)]}),
purrr::map(airquality,na.omit),
purrr::map(airquality, ~purrr::discard(.x, .p = is.na)),
times = 1e6)
Unit: microseconds
expr min lq mean median uq max neval cld
purrr::map(airquality, function(x) { x[!is.na(x)] }) 66.8 75.9 130.5643 86.2 131.80 541125.5 1e+06 a
purrr::map(airquality, na.omit) 95.7 107.4 185.5108 129.3 190.50 534795.5 1e+06 b
purrr::map(airquality, ~purrr::discard(.x, .p = is.na)) 3391.7 3648.6 5615.8965 4079.7 6486.45 1121975.4 1e+06 c
For reference, here's the original test of x[!is.na(x)]
vs na.omit
:
microbenchmark::microbenchmark(
purrr::map(airquality,function(x) {x[!is.na(x)]}),
purrr::map(airquality,na.omit),
times = 1000000)
Unit: microseconds
expr min lq mean median uq max neval cld
map(airquality, function(x) { x[!is.na(x)] }) 53.0 56.6 86.48231 58.1 64.8 414195.2 1e+06 a
map(airquality, na.omit) 85.3 90.4 134.49964 92.5 104.9 348352.8 1e+06 b
?max
shows you that there is an extra parameter na.rm
that you can set to TRUE
.
Apart from that, if you really want to remove the NA
s, just use something like:
myvec[!is.na(myvec)]
You can call max(vector, na.rm = TRUE)
. More generally, you can use the na.omit()
function.
Just in case someone new to R wants a simplified answer to the original question
How can I remove NA values from a vector?
Here it is:
Assume you have a vector foo
as follows:
foo = c(1:10, NA, 20:30)
running length(foo)
gives 22.
nona_foo = foo[!is.na(foo)]
length(nona_foo)
is 21, because the NA values have been removed.
Remember is.na(foo)
returns a boolean matrix, so indexing foo
with the opposite of this value will give you all the elements which are not NA.
Use discard
from purrr (works with lists and vectors).
discard(v, is.na)
The benefit is that it is easy to use pipes; alternatively use the built-in subsetting function [
:
v %>% discard(is.na)
v %>% `[`(!is.na(.))
Note that na.omit
does not work on lists:
> x <- list(a=1, b=2, c=NA)
> na.omit(x)
$a
[1] 1
$b
[1] 2
$c
[1] NA
The na.omit
function is what a lot of the regression routines use internally:
vec <- 1:1000
vec[runif(200, 1, 1000)] <- NA
max(vec)
#[1] NA
max( na.omit(vec) )
#[1] 1000
Source: Stackoverflow.com