I have a very large NumPy array
1 40 3
4 50 4
5 60 7
5 49 6
6 70 8
8 80 9
8 72 1
9 90 7
....
I want to check to see if a value exists in the 1st column of the array. I've got a bunch of homegrown ways (e.g. iterating through each row and checking), but given the size of the array I'd like to find the most efficient method.
Thanks!
This question is related to
python
performance
numpy
The most obvious to me would be:
np.any(my_array[:, 0] == value)
Fascinating. I needed to improve the speed of a series of loops that must perform matching index determination in this same way. So I decided to time all the solutions here, along with some riff's.
Here are my speed tests for Python 2.7.10:
import timeit
timeit.timeit('N.any(N.in1d(sids, val))', setup = 'import numpy as N; val = 20010401020091; sids = N.array([20010401010101+x for x in range(1000)])')
18.86137104034424
timeit.timeit('val in sids', setup = 'import numpy as N; val = 20010401020091; sids = [20010401010101+x for x in range(1000)]')
15.061666011810303
timeit.timeit('N.in1d(sids, val)', setup = 'import numpy as N; val = 20010401020091; sids = N.array([20010401010101+x for x in range(1000)])')
11.613027095794678
timeit.timeit('N.any(val == sids)', setup = 'import numpy as N; val = 20010401020091; sids = N.array([20010401010101+x for x in range(1000)])')
7.670552015304565
timeit.timeit('val in sids', setup = 'import numpy as N; val = 20010401020091; sids = N.array([20010401010101+x for x in range(1000)])')
5.610057830810547
timeit.timeit('val == sids', setup = 'import numpy as N; val = 20010401020091; sids = N.array([20010401010101+x for x in range(1000)])')
1.6632978916168213
timeit.timeit('val in sids', setup = 'import numpy as N; val = 20010401020091; sids = set([20010401010101+x for x in range(1000)])')
0.0548710823059082
timeit.timeit('val in sids', setup = 'import numpy as N; val = 20010401020091; sids = dict(zip([20010401010101+x for x in range(1000)],[True,]*1000))')
0.054754018783569336
Very surprising! Orders of magnitude difference!
To summarize, if you just want to know whether something's in a 1D list or not:
If you want to know where something is in the list as well (order is important):
Adding to @HYRY's answer in1d seems to be fastest for numpy. This is using numpy 1.8 and python 2.7.6.
In this test in1d was fastest, however 10 in a
look cleaner:
a = arange(0,99999,3)
%timeit 10 in a
%timeit in1d(a, 10)
10000 loops, best of 3: 150 µs per loop
10000 loops, best of 3: 61.9 µs per loop
Constructing a set is slower than calling in1d, but checking if the value exists is a bit faster:
s = set(range(0, 99999, 3))
%timeit 10 in s
10000000 loops, best of 3: 47 ns per loop
The most convenient way according to me is:
(Val in X[:, col_num])
where Val is the value that you want to check for and X is the array. In your example, suppose you want to check if the value 8 exists in your the third column. Simply write
(8 in X[:, 2])
This will return True if 8 is there in the third column, else False.
To check multiple values, you can use numpy.in1d(), which is an element-wise function version of the python keyword in. If your data is sorted, you can use numpy.searchsorted():
import numpy as np
data = np.array([1,4,5,5,6,8,8,9])
values = [2,3,4,6,7]
print np.in1d(values, data)
index = np.searchsorted(data, values)
print data[index] == values
Source: Stackoverflow.com