numpy most efficient frequency counts for unique values in an array

Question

In numpy   scipy  is there an efficient way to get frequency counts for unique values in an array   Something along these lines   x   array   1 1 1 2 2 2 5 25 1 1    y   freq count  x   print y   gt  gt    1  5    2 3    5 1    25 1       For you  R users out there  I m basically looking for the table   function

User · Answer

Using pandas module    gt  gt  gt  import pandas as pd  gt  gt  gt  import numpy as np  gt  gt  gt  x   np array  1 1 1 2 2 2 5 25 1 1    gt  gt  gt  pd value counts x  1     5 2     3 25    1 5     1 dtype  int64

User · Answer

Update  The method mentioned in the original answer is deprecated  we should use the new way instead    gt  gt  gt  import numpy as np  gt  gt  gt  x    1 1 1 2 2 2 5 25 1 1   gt  gt  gt  np array np unique x  return counts True   T     array    1   5                2   3                5   1               25   1        Original answer   you can use scipy stats itemfreq   gt  gt  gt  from scipy stats import itemfreq  gt  gt  gt  x    1 1 1 2 2 2 5 25 1 1   gt  gt  gt  itemfreq x   usr local bin python 1  DeprecationWarning   itemfreq  is deprecated   itemfreq  is deprecated and will be removed in a future version  Use instead  np unique      return counts True   array     1     5              2     3              5     1             25     1

User · Answer

Old question  but I d like to provide my own solution which turn out to be the fastest  use normal list instead of np array as input  or transfer to list firstly   based on my bench test   Check it out if you encounter it as well   def count a       results          for x in a          if x not in results              results x    1         else              results x     1     return results   For example     gt  gt  gt timeit count  1 1 1 2 2 2 5 25 1 1   would return    100000 loops  best of 3  2 26   s per loop   gt  gt  gt timeit count np array  1 1 1 2 2 2 5 25 1 1      100000 loops  best of 3  8 8   s per loop   gt  gt  gt timeit count np array  1 1 1 2 2 2 5 25 1 1   tolist      100000 loops  best of 3  5 85   s per loop  While the accepted answer would be slower  and the scipy stats itemfreq solution is even worse     A more indepth testing did not confirm the formulated expectation   from zmq import Stopwatch aZmqSTOPWATCH   Stopwatch    aDataSETasARRAY     100   abs  np random randn  150000       astype  np int   aDataSETasLIST    aDataSETasARRAY tolist    import numba  numba jit def numba bincount  anObject        np bincount     anObject       return  aZmqSTOPWATCH start   np bincount     aDataSETasARRAY   aZmqSTOPWATCH stop   14328L  aZmqSTOPWATCH start   numba bincount  aDataSETasARRAY   aZmqSTOPWATCH stop   592L  aZmqSTOPWATCH start   count           aDataSETasLIST    aZmqSTOPWATCH stop   148609L   Ref  comments below on cache and other in-RAM side-effects that influence a small dataset massively repetitive testing results

User · Answer

Even though it has already been answered  I suggest a different approach that makes use of numpy histogram  Such function given a sequence it returns the frequency of its elements grouped in bins   Beware though  it works in this example because numbers are integers  If they where real numbers  then this solution would not apply as nicely    gt  gt  gt  from numpy import histogram  gt  gt  gt  y   histogram  x  bins x max  -1   gt  gt  gt  y  array  5  3  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0         1     array    1     2     3     4     5     6     7     8     9    10    11           12    13    14    15    16    17    18    19    20    21    22           23    24    25

User · Answer

some thing like this should do it    create 100 random numbers arr   numpy random random integers 0 50 100    create a dictionary of the unique values d   dict   i 0  for i in numpy unique arr    for number in arr      d j   1    increment when that value is found   Also  this previous post on  Efficiently counting unique elements  seems pretty similar to your question  unless I m missing something

User · Answer

This is by far the most general and performant solution  surprised it hasn t been posted yet   import numpy as np  def unique count a       unique  inverse   np unique a  return inverse True      count   np zeros len unique   np int      np add at count  inverse  1      return np vstack   unique  count   T  print unique count np random randint -10 10 100     Unlike the currently accepted answer  it works on any datatype that is sortable  not just positive ints   and it has optimal performance  the only significant expense is in the sorting done by np unique

User · Answer

I was also interested in this  so I did a little performance comparison  using perfplot  a pet project of mine   Result   y   np bincount a  ii   np nonzero y  0  out   np vstack  ii  y ii    T   is by far the fastest   Note the log-scaling        Code to generate the plot   import numpy as np import pandas as pd import perfplot from scipy stats import itemfreq   def bincount a       y   np bincount a      ii   np nonzero y  0      return np vstack  ii  y ii    T   def unique a       unique  counts   np unique a  return counts True      return np asarray  unique  counts   T   def unique count a       unique  inverse   np unique a  return inverse True      count   np zeros len unique   np int      np add at count  inverse  1      return np vstack  unique  count   T   def pandas value counts a       out   pd value counts pd Series a       out sort index inplace True      out   np stack  out keys   values  out values   T     return out   perfplot show      setup lambda n  np random randint 0  1000  n       kernels  bincount  unique  itemfreq  unique count  pandas value counts       n range  2    k for k in range 26        logx True      logy True      xlabel  len a

User · Answer

numpy bincount is the probably the best choice  If your array contains anything besides small dense integers it might be useful to wrap it something like this   def count unique keys       uniq keys   np unique keys      bins   uniq keys searchsorted keys      return uniq keys  np bincount bins    For example    gt  gt  gt  x   array  1 1 1 2 2 2 5 25 1 1    gt  gt  gt  count unique x   array   1   2   5  25    array  5  3  1  1

User · Answer

import pandas as pd import numpy as np x   np array   1 1 1 2 2 2 5 25 1 1    print dict pd Series x  value counts       This gives you   1  5  2  3  5  1  25  1

User · Answer

import pandas as pd import numpy as np  print pd Series name of array  value counts

User · Answer

Take a look at np bincount   http   docs scipy org doc numpy reference generated numpy bincount html  import numpy as np x   np array  1 1 1 2 2 2 5 25 1 1   y   np bincount x  ii   np nonzero y  0    And then   zip ii y ii        1  5    2  3    5  1    25  1     or   np vstack  ii y ii    T   array    1   5              2   3              5   1             25   1      or however you want to combine the counts and the unique values

User · Answer

To count unique non-integers - similar to Eelco Hoogendoorn s answer but considerably faster  factor of 5 on my machine   I used weave inline to combine numpy unique with a bit of c-code    import numpy as np from scipy import weave  def count unique datain           Similar to numpy unique function for returning unique members of   data  but also returns their counts         data   np sort datain    uniq   np unique data    nums   np zeros uniq shape  dtype  int      code       int i count j    j 0    count 0    for i 1  i lt Ndata 0   i           count          if data i   gt  data i-1              nums j    count            count   0            j                     Handle last value   nums j    count 1          weave inline code          data    nums          extra compile args   -O2          type converters weave converters blitz    return uniq  nums   Profile info   gt   timeit count unique data   gt  10000 loops  best of 3  55 1   s per loop   Eelco s pure numpy version     gt   timeit unique count data   gt  1000 loops  best of 3  284   s per loop   Note  There s redundancy here  unique performs a sort also   meaning that the code could probably be further optimized by putting the unique functionality inside the c-code loop

User · Answer

Most of simple problems get complicated because simple functionality like order   in R that gives a statistical result in both and descending order is missing in various python libraries  But if we devise our thinking that all such statistical ordering and parameters in python are easily found in pandas  we can can result sooner than looking in 100 different places  Also  development of R and pandas go hand-in-hand because they were created for same purpose  To solve this problem I use following code that gets me by anywhere  unique  counts   np unique x  return counts True  d     unique  unique   counts  count     pass the list to a dictionary df   pd DataFrame d   dictionary object can be easily passed to make a dataframe df sort values by    count   ascending False  inplace   True  df   df reset index drop True   optional only if you want to use it further

User · Answer

multi-dimentional frequency count  i e  counting arrays    gt  gt  gt  print color array        array   255  128  128       255  128  128       255  128  128               255  128  128       255  128  128       255  128  128    dtype uint8     gt  gt  gt  np unique color array return counts True axis 0     array    60  151  161         60  155  162         60  159  163         61  143  162         61  147  162         61  162  163         62  166  164         63  137  162         63  169  164      array       1       2       2       1       4       1       1       2           3       1       1       1       2       5       2       2         898       1       1

User · Answer

from collections import Counter x   array   1 1 1 2 2 2 5 25 1 1    mode   counter most common 1  0  0

User · Answer

As of Numpy 1 9  the easiest and fastest method is to simply use numpy unique  which now has a return counts keyword argument   import numpy as np  x   np array  1 1 1 2 2 2 5 25 1 1   unique  counts   np unique x  return counts True   print np asarray  unique  counts   T   Which gives       1  5      2  3      5  1     25  1     A quick comparison with scipy stats itemfreq   In  4   x   np random random integers 0 100 1e6   In  5    timeit unique  counts   np unique x  return counts True  10 loops  best of 3  31 5 ms per loop  In  6    timeit scipy stats itemfreq x  10 loops  best of 3  170 ms per loop

[python] numpy: most efficient frequency counts for unique values in an array

Examples related to python

Examples related to arrays

Examples related to performance

Examples related to numpy