Old question, but I'd like to provide my own solution which turn out to be the fastest, use normal list
instead of np.array
as input (or transfer to list firstly), based on my bench test.
Check it out if you encounter it as well.
def count(a):
results = {}
for x in a:
if x not in results:
results[x] = 1
else:
results[x] += 1
return results
For example,
>>>timeit count([1,1,1,2,2,2,5,25,1,1]) would return:
100000 loops, best of 3: 2.26 µs per loop
>>>timeit count(np.array([1,1,1,2,2,2,5,25,1,1]))
100000 loops, best of 3: 8.8 µs per loop
>>>timeit count(np.array([1,1,1,2,2,2,5,25,1,1]).tolist())
100000 loops, best of 3: 5.85 µs per loop
While the accepted answer would be slower, and the scipy.stats.itemfreq
solution is even worse.
A more indepth testing did not confirm the formulated expectation.
from zmq import Stopwatch
aZmqSTOPWATCH = Stopwatch()
aDataSETasARRAY = ( 100 * abs( np.random.randn( 150000 ) ) ).astype( np.int )
aDataSETasLIST = aDataSETasARRAY.tolist()
import numba
@numba.jit
def numba_bincount( anObject ):
np.bincount( anObject )
return
aZmqSTOPWATCH.start();np.bincount( aDataSETasARRAY );aZmqSTOPWATCH.stop()
14328L
aZmqSTOPWATCH.start();numba_bincount( aDataSETasARRAY );aZmqSTOPWATCH.stop()
592L
aZmqSTOPWATCH.start();count( aDataSETasLIST );aZmqSTOPWATCH.stop()
148609L
Ref. comments below on cache and other in-RAM side-effects that influence a small dataset massively repetitive testing results.