pandas unique values multiple columns

Question

df   pd DataFrame   Col1     Bob    Joe    Bill    Mary    Joe                        Col2     Joe    Steve    Bob    Bob    Steve                        Col3   np random random 5      What is the best way to return the unique values of  Col1  and  Col2    The desired output is    Bob    Joe    Bill    Mary    Steve

User · Accepted Answer

pd unique returns the unique values from an input array  or DataFrame column or index  The input to this function needs to be one-dimensional  so multiple columns will need to be combined  The simplest way is to select the columns you want and then view the values in a flattened NumPy array  The whole operation looks like this   gt  gt  gt  pd unique df   Col1    Col2    values ravel  K    array   Bob    Joe    Bill    Mary    Steve    dtype object   Note that ravel   is an array method that returns a view  if possible  of a multidimensional array  The argument  K  tells the method to flatten the array in the order the elements are stored in the memory  pandas typically stores underlying arrays in Fortran-contiguous order  columns before rows   This can be significantly faster than using the method s default  C  order   An alternative way is to select the columns and pass them to np unique   gt  gt  gt  np unique df   Col1    Col2    values  array   Bill    Bob    Joe    Mary    Steve    dtype object   There is no need to use ravel   here as the method handles multidimensional arrays  Even so  this is likely to be slower than pd unique as it uses a sort-based algorithm rather than a hashtable to identify unique values  The difference in speed is significant for larger DataFrames  especially if there are only a handful of unique values    gt  gt  gt  df1   pd concat  df  100000  ignore index True    DataFrame with 500000 rows  gt  gt  gt   timeit np unique df1   Col1    Col2    values  1 loop  best of 3  1 12 s per loop   gt  gt  gt   timeit pd unique df1   Col1    Col2    values ravel  K    10 loops  best of 3  38 9 ms per loop   gt  gt  gt   timeit pd unique df1   Col1    Col2    values ravel      ravel using C order 10 loops  best of 3  49 9 ms per loop

User · Answer

Non-pandas solution  using set     import pandas as pd import numpy as np  df   pd DataFrame   Col1      Bob    Joe    Bill    Mary    Joe                   Col2      Joe    Steve    Bob    Bob    Steve                    Col3    np random random 5     print df  print set df Col1 append df Col2  values    Output      Col1   Col2      Col3 0   Bob    Joe  0 201079 1   Joe  Steve  0 703279 2  Bill    Bob  0 722724 3  Mary    Bob  0 093912 4   Joe  Steve  0 766027 set   Steve    Bob    Bill    Joe    Mary

User · Answer

here s another way   import numpy as np set np concatenate df values

User · Answer

I have setup a DataFrame with a few simple strings in it s columns    gt  gt  gt  df    a  b 0  a  g 1  b  h 2  d  a 3  e  e   You can concatenate the columns you are interested in and call unique function    gt  gt  gt  pandas concat  df  a    df  b     unique   array   a    b    d    e    g    h    dtype object

User · Answer

for those of us that love all things pandas  apply  and of course lambda functions   df  Col3     df   Col1    Col2    apply lambda x     join x   axis 1

User · Answer

list set df   Col1    Col2    as matrix   reshape  1 -1   tolist   0      The output will be   Mary    Joe    Steve    Bob    Bill

User · Answer

An updated solution using numpy v1 13  requires specifying the axis in np unique if using multiple columns  otherwise the array is implicitly flattened   import numpy as np  np unique df   col1    col2     axis 0    This change was introduced Nov 2016  https   github com numpy numpy commit 1f764dbff7c496d6636dc0430f083ada9ff4e4be

User · Answer

In  5   set df Col1  union set df Col2   Out 5     Bill    Bob    Joe    Mary    Steve     Or   set df Col1    set df Col2

[python] pandas unique values multiple columns

Examples related to python

Examples related to pandas

Examples related to dataframe

Examples related to unique