Find unique rows in numpy array

Question

I need to find unique rows in a numpy array   For example    gt  gt  gt  a   I have array   1  1  1  0  0  0           0  1  1  1  0  0           0  1  1  1  0  0           1  1  1  0  0  0           1  1  1  1  1  0     gt  gt  gt  new a   I want to get to array   1  1  1  0  0  0           0  1  1  1  0  0           1  1  1  1  1  0      I know that i can create a set and loop over the array  but I am looking for an efficient pure numpy solution  I believe that there is a way to set data type to void and then I could just use numpy unique  but I couldn t figure out how to make it work

User · Answer

np unique works by sorting a flattened array  then looking at whether each item is equal to the previous  This can be done manually without flattening   ind   np lexsort a T  a ind np concatenate   True  np any a ind 1     a ind  -1   axis 1        This method does not use tuples  and should be much faster and simpler than other methods given here   NOTE  A previous version of this did not have the ind right after a   which mean that the wrong indices were used  Also  Joe Kington makes a good point that this does make a variety of intermediate copies  The following method makes fewer  by making a sorted copy and then using views of it   b   a np lexsort a T   b np concatenate   True   np any b 1      b  -1  axis 1       This is faster and uses less memory   Also  if you want to find unique rows in an ndarray regardless of how many dimensions are in the array  the following will work   b   a lexsort a reshape  a shape 0  -1   T    b np concatenate   True   np any b 1    b  -1  axis tuple range 1 a ndim         An interesting remaining issue would be if you wanted to sort unique along an arbitrary axis of an arbitrary-dimension array  something that would be more difficult   Edit   To demonstrate the speed differences  I ran a few tests in ipython of the three different methods described in the answers  With your exact a  there isn t too much of a difference  though this version is a bit faster   In  87    timeit unique a view dtype   view   lt i8   10000 loops  best of 3  48 4 us per loop  In  88    timeit ind   np lexsort a T   a np concatenate   True   np any a ind 1      a ind  -1    axis 1     10000 loops  best of 3  37 6 us per loop  In  89    timeit b    tuple row  for row in a   np unique b  10000 loops  best of 3  41 6 us per loop   With a larger a  however  this version ends up being much  much faster   In  96   a   np random randint 0 2 size  10000 6    In  97    timeit unique a view dtype   view   lt i8   10 loops  best of 3  24 4 ms per loop  In  98    timeit b    tuple row  for row in a   np unique b  10 loops  best of 3  28 2 ms per loop  In  99    timeit ind   np lexsort a T   a np concatenate   True  np any a ind 1      a ind  -1   axis 1     100 loops  best of 3  3 25 ms per loop

User · Answer

For general purpose like 3D or higher multidimensional nested arrays  try this   import numpy as np  def unique nested arrays ar       origin shape   ar shape     origin dtype   ar dtype     ar   ar reshape origin shape 0   np prod origin shape 1         ar   np ascontiguousarray ar      unique ar   np unique ar view       origin dtype   np prod origin shape 1          return unique ar view origin dtype  reshape  unique ar shape 0       origin shape 1      which satisfies your 2D dataset   a   np array   1  1  1  0  0  0           0  1  1  1  0  0           0  1  1  1  0  0           1  1  1  0  0  0           1  1  1  1  1  0    unique nested arrays a    gives   array   0  1  1  1  0  0       1  1  1  0  0  0       1  1  1  1  1  0      But also 3D arrays like   b   np array    1  1  1    0  1  1                    0  1  1    1  1  1                    1  1  1    0  1  1                    1  1  1    1  1  1     unique nested arrays b    gives   array    0  1  1    1  1  1         1  1  1    0  1  1         1  1  1    1  1  1

User · Answer

If you want to avoid the memory expense of converting to a series of tuples or another similar data structure  you can exploit numpy s structured arrays   The trick is to view your original array as a structured array where each item corresponds to a row of the original array    This doesn t make a copy  and is quite efficient   As a quick example   import numpy as np  data   np array   1  1  1  0  0  0                     0  1  1  1  0  0                     0  1  1  1  0  0                     1  1  1  0  0  0                     1  1  1  1  1  0     ncols   data shape 1  dtype   data dtype descr   ncols struct   data view dtype   uniq   np unique struct  uniq   uniq view data dtype  reshape -1  ncols  print uniq     To understand what s going on  have a look at the intermediary results   Once we view things as a structured array  each element in the array is a row in your original array    Basically  it s a similar data structure to a list of tuples    In  71   struct Out 71   array    1  1  1  0  0  0             0  1  1  1  0  0             0  1  1  1  0  0             1  1  1  0  0  0             1  1  1  1  1  0           dtype    f0     lt i8      f1     lt i8      f2     lt i8      f3     lt i8      f4     lt i8      f5     lt i8      In  72   struct 0  Out 72   array   1  1  1  0  0  0          dtype    f0     lt i8      f1     lt i8      f2     lt i8      f3     lt i8      f4     lt i8      f5     lt i8       Once we run numpy unique  we ll get a structured array back   In  73   np unique struct  Out 73   array   0  1  1  1  0  0    1  1  1  0  0  0    1  1  1  1  1  0          dtype    f0     lt i8      f1     lt i8      f2     lt i8      f3     lt i8      f4     lt i8      f5     lt i8       That we then need to view as a  normal  array    stores the result of the last calculation in ipython  which is why you re seeing   view       In  74     view data dtype  Out 74   array  0  1  1  1  0  0  1  1  1  0  0  0  1  1  1  1  1  0     And then reshape back into a 2D array  -1 is a placeholder that tells numpy to calculate the correct number of rows  give the number of columns    In  75     reshape -1  ncols  Out 75   array   0  1  1  1  0  0           1  1  1  0  0  0           1  1  1  1  1  0        Obviously  if you wanted to be more concise  you could write it as   import numpy as np  def unique rows data       uniq   np unique data view data dtype descr   data shape 1        return uniq view data dtype  reshape -1  data shape 1    data   np array   1  1  1  0  0  0                     0  1  1  1  0  0                     0  1  1  1  0  0                     1  1  1  0  0  0                     1  1  1  1  1  0    print unique rows data    Which results in     0 1 1 1 0 0    1 1 1 0 0 0    1 1 1 1 1 0

User · Answer

np unique works given a list of tuples    gt  gt  gt  np unique   1  1    2  2    3  3    4  4    2  2    Out 9    array   1  1           2  2           3  3           4  4      With a list of lists it raises a TypeError  unhashable type   list

User · Answer

np unique when I run it on np random random 100  reshape 10 10  returns all the unique individual elements  but you want the unique rows  so first you need to put them into tuples   array    your numpy array of lists new array    tuple row  for row in array  uniques   np unique new array    That is the only way I see you changing the types to do what you want  and I am not sure if the list iteration to change to tuples is okay with your  not looping through

User · Answer

We can actually turn m x n numeric numpy array into m x 1 numpy string array  please try using the following function  it provides count  inverse idx and etc  just like numpy unique   import numpy as np  def uniqueRow a        This function turn m x n numpy array into m x 1 numpy array storing       string  and so the np unique can be used       Input  an m x n numpy array  a       Output unique m  x n numpy array  unique   inverse indx  and counts       s   np chararray  a shape 0  1       s       -       b    a  astype np str       s2   np expand dims b   0  axis 1    s   np expand dims b   1  axis 1       n   a shape 1  - 2          for i in range 0 n            s2   s2   s   np expand dims b   i 2  axis 1       s3  idx  inv   c   np unique s2 return index   True   return inverse   True  return counts   True       return a idx   inv   c   Example   A   np array    3 17   9 502  3 291       9 984  2 773  6 852       1 172  8 885  4 258       9 73   7 518  3 227       8 113  9 563  9 117       9 984  2 773  6 852       9 73   7 518  3 227     B  inv   c   uniqueRow A   Results   B     1 172  8 885  4 258    3 17   9 502  3 291    8 113  9 563  9 117    9 73   7 518  3 227    9 984  2 773  6 852    inv    3 4 1 0 2 4 0   c   2 1 1 1 2

User · Answer

The numpy indexed package  disclaimer  I am its author  wraps the solution posted by Jaime in a nice and tested interface  plus many more features   import numpy indexed as npi new a   npi unique a     unique elements over axis 0  rows  by default

User · Answer

import numpy as np original   np array   1  1  1  0  0  0                         0  1  1  1  0  0                         0  1  1  1  0  0                         1  1  1  0  0  0                         1  1  1  1  1  0      create a view that the subarray as tuple and return unique indeies     unique index   np unique original view original dtype descr   original shape 1                                return index True    get unique set print original unique index

User · Answer

Based on the answer in this page I have written a function that replicates the capability of MATLAB s unique input  rows   function  with the additional feature to accept tolerance for checking the uniqueness  It also returns the indices such that c   data ia    and data   c ic     Please report if you see any discrepancies or errors   def unique rows data  prec 5       import numpy as np     d r   np fix data   10    prec    10    prec   0 0     b   np ascontiguousarray d r  view np dtype  np void  d r dtype itemsize   d r shape 1            ia   np unique b  return index True         ic   np unique b  return inverse True      return np unique b  view d r dtype  reshape -1  d r shape 1    ia  ic

User · Answer

Another option to the use of structured arrays is using a view of a void type that joins the whole row into a single item   a   np array   1  1  1  0  0  0                  0  1  1  1  0  0                  0  1  1  1  0  0                  1  1  1  0  0  0                  1  1  1  1  1  0     b   np ascontiguousarray a  view np dtype  np void  a dtype itemsize   a shape 1        idx   np unique b  return index True   unique a   a idx    gt  gt  gt  unique a array   0  1  1  1  0  0           1  1  1  0  0  0           1  1  1  1  1  0      EDIT Added np ascontiguousarray following  seberg s recommendation  This will slow the method down if the array is not already contiguous   EDIT The above can be slightly sped up  perhaps at the cost of clarity  by doing   unique a   np unique b  view a dtype  reshape -1  a shape 1     Also  at least on my system  performance wise it is on par  or even better  than the lexsort method   a   np random randint 2  size  10000  6     timeit np unique a view np dtype  np void  a dtype itemsize a shape 1      view a dtype  reshape -1  a shape 1   100 loops  best of 3  3 17 ms per loop   timeit ind   np lexsort a T   a np concatenate   True  np any a ind 1     a ind  -1   axis 1     100 loops  best of 3  5 93 ms per loop  a   np random randint 2  size  10000  100     timeit np unique a view np dtype  np void  a dtype itemsize a shape 1      view a dtype  reshape -1  a shape 1   10 loops  best of 3  29 9 ms per loop   timeit ind   np lexsort a T   a np concatenate   True  np any a ind 1     a ind  -1   axis 1     10 loops  best of 3  116 ms per loop

User · Answer

The most straightforward solution is to make the rows a single item by making them strings  Each row then can be  compared as a whole for its uniqueness using numpy  This solution is generalize-able you just need to reshape and transpose your array for other combinations  Here is the solution for the problem provided   import numpy as np  original   np array   1  1  1  0  0  0           0  1  1  1  0  0           0  1  1  1  0  0           1  1  1  0  0  0           1  1  1  1  1  0     uniques  index   np unique  str i  for i in original   return index True  cleaned   original index  print cleaned        Will Give    array   0  1  1  1  0  0            1  1  1  0  0  0            1  1  1  1  1  0      Send my nobel prize in the mail

User · Answer

None of these answers worked for me  I m assuming as my unique rows contained strings and not numbers  However this answer from another thread did work   Source  https   stackoverflow com a 38461043 5402386  You can use  count   and  index   list s methods  coor   np array   10  10    12  9    10  5    12  9    coor tuple    tuple x  for x in coor  unique coor   sorted set coor tuple   key lambda x  coor tuple index x   unique count    coor tuple count x  for x in unique coor  unique index    coor tuple index x  for x in unique coor

User · Answer

Here is another variation for  Greg pythonic answer  np vstack set map tuple  a

User · Answer

Why not use drop duplicates from pandas    gt  gt  gt  timeit pd DataFrame image reshape -1 3   drop duplicates   values 1 loops  best of 3  3 08 s per loop   gt  gt  gt  timeit np vstack  tuple r  for r in image reshape -1 3    1 loops  best of 3  51 s per loop

User · Answer

Lets get the entire numpy matrix as a list  then drop duplicates from this list  and finally return our unique list back into a numpy matrix   matrix as list data tolist    matrix as list    1  1  1  0  0  0    0  1  1  1  0  0    0  1  1  1  0  0    1  1  1  0  0  0    1  1  1  1  1  0    uniq list list   uniq list append matrix as list 0     uniq list append item  for item in matrix as list if item not in uniq list   unique matrix np array uniq list  unique matrix  array   1  1  1  0  0  0           0  1  1  1  0  0           1  1  1  1  1  0

User · Answer

I didn   t like any of these answers because none handle floating-point arrays in a linear algebra or vector space sense  where two rows being    equal    means    within some      The one answer that has a tolerance threshold  https   stackoverflow com a 26867764 500207  took the threshold to be both element-wise and decimal precision  which works for some cases but isn   t as mathematically general as a true vector distance   Here   s my version   from scipy spatial distance import squareform  pdist  def uniqueRows arr  thresh 0 0  metric  euclidean         Returns subset of rows that are unique  in terms of Euclidean distance      distances   squareform pdist arr  metric metric       idxset    tuple np nonzero v  0   for v in distances  lt   thresh      return arr  x 0  for x in idxset      With this  unique columns are super-easy  def uniqueColumns arr   args    kwargs       return uniqueRows arr T   args    kwargs    The public-domain function above uses scipy spatial distance pdist to find the Euclidean  customizable  distance between each pair of rows  Then it compares each each distance to a threshold to find the rows that are within thresh of each other  and returns just one row from each thresh-cluster   As hinted  the distance metric needn   t be Euclidean   pdist can compute sundry distances including cityblock  Manhattan-norm  and cosine  the angle between vectors    If thresh 0  the default   then rows have to be bit-exact to be considered    unique     Other good values for thresh use scaled machine-precision  i e   thresh np spacing 1  1e3

User · Answer

As of NumPy 1 13  one can simply choose the axis for selection of unique values in any N-dim array  To get unique rows  one can do   unique rows   np unique original array  axis 0

User · Answer

I ve compared the suggested alternative for speed and found that  surprisingly  the void view unique solution is even a bit faster than numpy s native unique with the axis argument  If you re looking for speed  you ll want numpy unique      a view numpy dtype  numpy void  a dtype itemsize a shape 1       view a dtype  reshape -1  a shape 1    There is a bug report on GitHub for this  too    Code to reproduce the plot  import numpy import perfplot   def unique void view a       return           numpy unique a view numpy dtype  numpy void  a dtype itemsize   a shape 1               view a dtype           reshape -1  a shape 1           def lexsort a       ind   numpy lexsort a T      return a          ind numpy concatenate   True   numpy any a ind 1       a ind  -1    axis 1             def vstack a       return numpy vstack  tuple row  for row in a     def unique axis a       return numpy unique a  axis 0    perfplot show      setup lambda n  numpy random randint 2  size  n  20        kernels  unique void view  lexsort  vstack  unique axis       n range  2    k for k in range 15        xlabel  quot len a  quot       equality check None

User · Answer

Beyond  Jaime excellent answer  another way to collapse a row is to uses a strides 0   assuming a is C-contiguous  which is equal to a dtype itemsize a shape 0   Furthermore void n  is a shortcut for dtype  void n    we arrive finally to this shortest version       a unique a view void a strides 0    1  1     For    0 1 1 1 0 0    1 1 1 0 0 0    1 1 1 1 1 0

User · Answer

Yet another possible solution   np vstack  tuple row  for row in a

[python] Find unique rows in numpy.array

Examples related to python

Examples related to arrays

Examples related to numpy

Examples related to unique