show distinct column values in pyspark dataframe python

Question

Please suggest pyspark dataframe alternative for Pandas df  col   unique     I want to list out all the unique values in a pyspark dataframe column   Not the SQL type way  registertemplate then SQL query for distinct values    Also I don t need groupby- gt countDistinct  instead I want to check distinct VALUES in that column

User · Answer

In addition to the dropDuplicates option there is the method named as we know it in pandas drop duplicates      drop duplicates   is an alias for dropDuplicates      Example  s df   sqlContext createDataFrame    foo   1                                        foo   1                                        bar   2                                        foo   3      k    v    s df show     --- ---     k   v   --- ---   foo   1   foo   1   bar   2   foo   3   --- ---      Drop by subset  s df drop duplicates subset     k    show     --- ---     k   v   --- ---   bar   2   foo   1   --- ---  s df drop duplicates   show      --- ---     k   v   --- ---   bar   2   foo   3   foo   1   --- ---

User · Answer

If you want to select ALL columns  data as distinct frrom a DataFrame  df   then  df select      distinct   show 10 truncate False

User · Answer

You can use df dropDuplicates   col1   col2    to get only distinct rows based on colX in the array

User · Answer

collect set can help to get unique values from a given column of pyspark sql DataFrame df select F collect set  column   alias  column    first    column

User · Answer

you could do   distinct column    somecol    distinct column vals   df select distinct column  distinct   collect   distinct column vals    v distinct column  for v in distinct column vals

User · Answer

If you want to see the distinct values of a specific column in your dataframe   you would just need to write -     df select  colname   distinct   show 100 False   This would show the 100 distinct values  if 100 values are available  for the colname column in the df dataframe  If you want to do something fancy on the distinct values  you can save the distinct values in a vector     a   df select  colname   distinct    Here  a would have all the distinct values of the column colname

User · Answer

This should help to get distinct values of a column   df select  column1   distinct   collect     Note that  collect   doesn t have any built-in limit on how many values can return so this might be slow -- use  show   instead or add  limit 20  before  collect   to manage this

User · Answer

Run this first  df createOrReplaceTempView  df     Then run  spark sql         SELECT distinct         column name     FROM         df          show

User · Answer

Let s assume we re working with the following representation of data  two columns  k and v  where k contains three entries  two unique    --- ---     k   v   --- ---   foo   1   bar   2   foo   3   --- ---    With a Pandas dataframe   import pandas as pd p df   pd DataFrame    foo   1     bar   2     foo   3    columns   k    v    p df  k   unique     This returns an ndarray  i e  array   foo    bar    dtype object   You asked for a  pyspark dataframe alternative for pandas df  col   unique     Now  given the following Spark dataframe   s df   sqlContext createDataFrame    foo   1     bar   2     foo   3      k    v      If you want the same result from Spark  i e  an ndarray  use toPandas     s df toPandas    k   unique     Alternatively  if you don t need an ndarray specifically and just want a list of the unique values of column k   s df select  k   distinct   rdd map lambda r  r 0   collect     Finally  you can also use a list comprehension as follows    i k for i in s df select  k   distinct   collect

[pyspark] show distinct column values in pyspark dataframe: python

Examples related to pyspark

Examples related to pyspark-sql