How to select distinct across multiple data frame columns in pandas

Question

I m looking for a way to do the equivalent to the SQL   SELECT DISTINCT col1  col2 FROM dataframe table   The pandas sql comparison doesn t have anything about distinct    unique   only works for a single column  so I suppose I could concat the columns  or put them in a list tuple and compare that way  but this seems like something pandas should do in a more native way     Am I missing something obvious  or is there no way to do this

User · Answer

I ve tried different solutions  First was   a df np unique df   col1   col2     axis 0    and it works well for not object data Another way to do this and to avoid error  for object columns type  is to apply drop duplicates     a df df drop duplicates   col1   col2      col1   col2      You can also use SQL to do this  but it worked very slow in my case   from pandasql import sqldf q    SELECT DISTINCT col1  col2 FROM df     pysqldf   lambda q  sqldf q  globals    a df   pysqldf q

User · Answer

There is no unique method for a df  if the number of unique values for each column were the same then the following would work  df apply pd Series unique  but if not then you will get an error  Another approach would be to store the values in a dict which is keyed on the column name   In  111   df   pd DataFrame   a   0 1 2 2 4    b   1 1 1 2 2    d    for col in df      d col    df col  unique   d  Out 111     a   array  0  1  2  4   dtype int64    b   array  1  2   dtype int64

User · Answer

You can take the sets of the columns and just subtract the smaller set from the larger set   distinct values   set df  a   -set df  b

User · Answer

To solve a similar problem  I m using groupby   print f Distinct entries   len df groupby   col1    col2          Whether that s appropriate will depend on what you want to do with the result  though  in my case  I just wanted the equivalent of COUNT DISTINCT as shown

User · Answer

I think use drop duplicate sometimes will not so useful depending dataframe   I found this    in  df  col 1   unique    out  array   A    B    C    dtype object    And work for me   https   riptutorial com pandas example 26077 select-distinct-rows-across-dataframe

User · Answer

You can use the drop duplicates method to get the unique rows in a DataFrame   In  29   df   pd DataFrame   a   1 2 1 2    b   3 4 3 5     In  30   df Out 30      a  b 0  1  3 1  2  4 2  1  3 3  2  5  In  32   df drop duplicates   Out 32      a  b 0  1  3 1  2  4 3  2  5   You can also provide the subset keyword argument if you only want to use certain columns to determine uniqueness  See the docstring

[python] How to "select distinct" across multiple data frame columns in pandas?

Examples related to python

Examples related to pandas

Examples related to dataframe

Examples related to duplicates

Examples related to distinct