[python] What is the most efficient way to check if a value exists in a NumPy array?

I have a very large NumPy array

1 40 3
4 50 4
5 60 7
5 49 6
6 70 8
8 80 9
8 72 1
9 90 7
.... 

I want to check to see if a value exists in the 1st column of the array. I've got a bunch of homegrown ways (e.g. iterating through each row and checking), but given the size of the array I'd like to find the most efficient method.

Thanks!

This question is related to python performance numpy

The answer is


The most obvious to me would be:

np.any(my_array[:, 0] == value)

Fascinating. I needed to improve the speed of a series of loops that must perform matching index determination in this same way. So I decided to time all the solutions here, along with some riff's.

Here are my speed tests for Python 2.7.10:

import timeit
timeit.timeit('N.any(N.in1d(sids, val))', setup = 'import numpy as N; val = 20010401020091; sids = N.array([20010401010101+x for x in range(1000)])')

18.86137104034424

timeit.timeit('val in sids', setup = 'import numpy as N; val = 20010401020091; sids = [20010401010101+x for x in range(1000)]')

15.061666011810303

timeit.timeit('N.in1d(sids, val)', setup = 'import numpy as N; val = 20010401020091; sids = N.array([20010401010101+x for x in range(1000)])')

11.613027095794678

timeit.timeit('N.any(val == sids)', setup = 'import numpy as N; val = 20010401020091; sids = N.array([20010401010101+x for x in range(1000)])')

7.670552015304565

timeit.timeit('val in sids', setup = 'import numpy as N; val = 20010401020091; sids = N.array([20010401010101+x for x in range(1000)])')

5.610057830810547

timeit.timeit('val == sids', setup = 'import numpy as N; val = 20010401020091; sids = N.array([20010401010101+x for x in range(1000)])')

1.6632978916168213

timeit.timeit('val in sids', setup = 'import numpy as N; val = 20010401020091; sids = set([20010401010101+x for x in range(1000)])')

0.0548710823059082

timeit.timeit('val in sids', setup = 'import numpy as N; val = 20010401020091; sids = dict(zip([20010401010101+x for x in range(1000)],[True,]*1000))')

0.054754018783569336

Very surprising! Orders of magnitude difference!

To summarize, if you just want to know whether something's in a 1D list or not:

  • 19s N.any(N.in1d(numpy array))
  • 15s x in (list)
  • 8s N.any(x == numpy array)
  • 6s x in (numpy array)
  • .1s x in (set or a dictionary)

If you want to know where something is in the list as well (order is important):

  • 12s N.in1d(x, numpy array)
  • 2s x == (numpy array)

Adding to @HYRY's answer in1d seems to be fastest for numpy. This is using numpy 1.8 and python 2.7.6.

In this test in1d was fastest, however 10 in a look cleaner:

a = arange(0,99999,3)
%timeit 10 in a
%timeit in1d(a, 10)

10000 loops, best of 3: 150 µs per loop
10000 loops, best of 3: 61.9 µs per loop

Constructing a set is slower than calling in1d, but checking if the value exists is a bit faster:

s = set(range(0, 99999, 3))
%timeit 10 in s

10000000 loops, best of 3: 47 ns per loop

The most convenient way according to me is:

(Val in X[:, col_num])

where Val is the value that you want to check for and X is the array. In your example, suppose you want to check if the value 8 exists in your the third column. Simply write

(8 in X[:, 2])

This will return True if 8 is there in the third column, else False.


To check multiple values, you can use numpy.in1d(), which is an element-wise function version of the python keyword in. If your data is sorted, you can use numpy.searchsorted():

import numpy as np
data = np.array([1,4,5,5,6,8,8,9])
values = [2,3,4,6,7]
print np.in1d(values, data)

index = np.searchsorted(data, values)
print data[index] == values

Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to performance

Why is 2 * (i * i) faster than 2 * i * i in Java? What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism? How to check if a key exists in Json Object and get its value Why does C++ code for testing the Collatz conjecture run faster than hand-written assembly? Most efficient way to map function over numpy array The most efficient way to remove first N elements in a list? Fastest way to get the first n elements of a List into an Array Why is "1000000000000000 in range(1000000000000001)" so fast in Python 3? pandas loc vs. iloc vs. at vs. iat? Android Recyclerview vs ListView with Viewholder

Examples related to numpy

Unable to allocate array with shape and data type How to fix 'Object arrays cannot be loaded when allow_pickle=False' for imdb.load_data() function? Numpy, multiply array with scalar TypeError: only integer scalar arrays can be converted to a scalar index with 1D numpy indices array Could not install packages due to a "Environment error :[error 13]: permission denied : 'usr/local/bin/f2py'" Pytorch tensor to numpy array Numpy Resize/Rescale Image what does numpy ndarray shape do? How to round a numpy array? numpy array TypeError: only integer scalar arrays can be converted to a scalar index