Fast check for NaN in NumPy

Question

I m looking for the fastest way to check for the occurrence of NaN  np nan  in a NumPy array X  np isnan X  is out of the question  since it builds a boolean array of shape X shape  which is potentially gigantic   I tried np nan in X  but that seems not to work because np nan    np nan  Is there a fast and memory-efficient way to do this at all    To those who would ask  how gigantic   I can t tell  This is input validation for library code

User · Accepted Answer

Ray s solution is good  However  on my machine it is about 2 5x faster to use numpy sum in place of numpy min   In  13    timeit np isnan np min x   1000 loops  best of 3  244 us per loop  In  14    timeit np isnan np sum x   10000 loops  best of 3  97 3 us per loop   Unlike min  sum doesn t require branching  which on modern hardware tends to be pretty expensive  This is probably the reason why sum is faster   edit The above test was performed with a single NaN right in the middle of the array   It is interesting to note that min is slower in the presence of NaNs than in their absence  It also seems to get slower as NaNs get closer to the start of the array  On the other hand  sum s throughput seems constant regardless of whether there are NaNs and where they re located   In  40   x   np random rand 100000   In  41    timeit np isnan np min x   10000 loops  best of 3  153 us per loop  In  42    timeit np isnan np sum x   10000 loops  best of 3  95 9 us per loop  In  43   x 50000    np nan  In  44    timeit np isnan np min x   1000 loops  best of 3  239 us per loop  In  45    timeit np isnan np sum x   10000 loops  best of 3  95 8 us per loop  In  46   x 0    np nan  In  47    timeit np isnan np min x   1000 loops  best of 3  326 us per loop  In  48    timeit np isnan np sum x   10000 loops  best of 3  95 9 us per loop

User · Answer

There are two general approaches here    Check each array item for nan and take any  Apply some cumulative operation that preserves nans  like sum  and check its result    While the first approach is certainly the cleanest  the heavy optimization of some of the cumulative operations  particularly the ones that are executed in BLAS  like dot  can make those quite fast  Note that dot  like some other BLAS operations  are multithreaded under certain conditions  This explains the difference in speed between different machines     import numpy import perfplot   def min a       return numpy isnan numpy min a     def sum a       return numpy isnan numpy sum a     def dot a       return numpy isnan numpy dot a  a     def any a       return numpy any numpy isnan a     def einsum a       return numpy isnan numpy einsum  i- gt    a     perfplot show      setup lambda n  numpy random rand n       kernels  min  sum  dot  any  einsum       n range  2    k for k in range 20        logx True      logy True      xlabel  len a

User · Answer

use  any   if numpy isnan myarray  any    numpy isfinite maybe better than isnan for checking if not np isfinite prop  all

User · Answer

Related to this is the question of how to find the first occurrence of NaN  This is the fastest way to handle that that I know of   index   next  i for  i n  in enumerate iterable  if n  n   None

User · Answer

Even there exist an accepted answer  I ll like to demonstrate the following  with Python 2 7 2 and Numpy 1 6 0 on Vista    In     x  rand 1e5  In      timeit isnan x min    10000 loops  best of 3  200 us per loop In      timeit isnan x sum    10000 loops  best of 3  169 us per loop In      timeit isnan dot x  x   10000 loops  best of 3  134 us per loop  In     x 5e4   NaN In      timeit isnan x min    100 loops  best of 3  4 47 ms per loop In      timeit isnan x sum    100 loops  best of 3  6 44 ms per loop In      timeit isnan dot x  x   10000 loops  best of 3  138 us per loop   Thus  the really efficient way might be heavily dependent on the operating system  Anyway dot    based seems to be the most stable one

User · Answer

If you re comfortable with numba it allows to create a fast short-circuit  stops as soon as a NaN is found  function   import numba as nb import math   nb njit def anynan array       array   array ravel       for i in range array size           if math isnan array i                return True     return False   If there is no NaN the function might actually be slower than np min  I think that s because np min uses multiprocessing for large arrays   import numpy as np array   np random random 2000000    timeit anynan array             100 loops  best of 3  2 21 ms per loop  timeit np isnan array sum       100 loops  best of 3  4 45 ms per loop  timeit np isnan array min       1000 loops  best of 3  1 64 ms per loop   But in case there is a NaN in the array  especially if it s position is at low indices  then it s much faster   array   np random random 2000000  array 100    np nan   timeit anynan array             1000000 loops  best of 3  1 93   s per loop  timeit np isnan array sum       100 loops  best of 3  4 57 ms per loop  timeit np isnan array min       1000 loops  best of 3  1 65 ms per loop   Similar results may be achieved with Cython or a C extension  these are a bit more complicated  or easily avaiable as bottleneck anynan  but ultimatly do the same as my anynan function

User · Answer

I think np isnan np min X   should do what you want

[python] Fast check for NaN in NumPy

Examples related to python

Examples related to performance

Examples related to numpy

Examples related to nan