How to select the first row of each group

Question

I have a DataFrame generated as follow   df groupBy   Hour     Category      agg sum   value   as  TotalValue      sort   Hour  asc    TotalValue  desc     The results look like    ---- -------- ----------   Hour Category TotalValue   ---- -------- ----------      0    cat26       30 9      0    cat13       22 1      0    cat95       19 6      0   cat105        1 3      1    cat67       28 5      1     cat4       26 8      1    cat13       12 6      1    cat23        5 3      2    cat56       39 6      2    cat40       29 7      2   cat187       27 9      2    cat68        9 8      3     cat8       35 6                              ---- -------- ----------    As you can see  the DataFrame is ordered by Hour in an increasing order  then by TotalValue in a descending order   I would like to select the top row of each group  i e    from the group of Hour  0 select  0 cat26 30 9  from the group of Hour  1 select  1 cat67 28 5  from the group of Hour  2 select  2 cat56 39 6  and so on   So the desired output would be    ---- -------- ----------   Hour Category TotalValue   ---- -------- ----------      0    cat26       30 9      1    cat67       28 5      2    cat56       39 6      3     cat8       35 6                              ---- -------- ----------    It might be handy to be able to select the top N rows of each group as well   Any help is highly appreciated

User · Answer

For Spark 2 0 2 with grouping by multiple columns   import org apache spark sql functions row number import org apache spark sql expressions Window  val w   Window partitionBy   col1     col2     col3   orderBy   timestamp  desc   val refined df   df withColumn  rn   row number over w   where   rn      1  drop  rn

User · Answer

This is a exact same of zero323 s answer but in SQL query way   Assuming that dataframe is created and registered as   df createOrReplaceTempView  table      ---- -------- ----------     Hour Category TotalValue     ---- -------- ----------     0    cat26    30 9           0    cat13    22 1           0    cat95    19 6           0    cat105   1 3            1    cat67    28 5           1    cat4     26 8           1    cat13    12 6           1    cat23    5 3            2    cat56    39 6           2    cat40    29 7           2    cat187   27 9           2    cat68    9 8            3    cat8     35 6           ---- -------- ----------    Window function    sqlContext sql  select Hour  Category  TotalValue from  select    row number   OVER  PARTITION BY Hour ORDER BY TotalValue DESC  as rn  FROM table  tmp where rn   1   show false     ---- -------- ----------     Hour Category TotalValue     ---- -------- ----------     1    cat67    28 5           3    cat8     35 6           2    cat56    39 6           0    cat26    30 9           ---- -------- ----------    Plain SQL aggregation followed by join   sqlContext sql  select Hour  first Category  as Category  first TotalValue  as TotalValue from         select Hour  Category  TotalValue from table tmp1        join         select Hour as max hour  max TotalValue  as max value from table group by Hour  tmp2        on        tmp1 Hour   tmp2 max hour and tmp1 TotalValue   tmp2 max value  tmp3        group by tmp3 Hour      show false     ---- -------- ----------     Hour Category TotalValue     ---- -------- ----------     1    cat67    28 5           3    cat8     35 6           2    cat56    39 6           0    cat26    30 9           ---- -------- ----------    Using ordering over structs   sqlContext sql  select Hour  vs Category  vs TotalValue from  select Hour  max struct TotalValue  Category   as vs from table group by Hour    show false     ---- -------- ----------     Hour Category TotalValue     ---- -------- ----------     1    cat67    28 5           3    cat8     35 6           2    cat56    39 6           0    cat26    30 9           ---- -------- ----------    DataSets way and don t dos are same as in original answer

User · Answer

We can use the rank   window function  where you would choose the rank   1  rank just adds a number for every row of a group  in this case it would be the hour   here s an example    from https   github com jaceklaskowski mastering-apache-spark-book blob master spark-sql-functions adoc rank    val dataset   spark range 9  withColumn  bucket    id   3   import org apache spark sql expressions Window val byBucket   Window partitionBy  bucket  orderBy  id   scala gt  dataset withColumn  rank   rank over byBucket  show  --- ------ ----    id bucket rank   --- ------ ----     0      0    1     3      0    2     6      0    3     1      1    1     4      1    2     7      1    3     2      2    1     5      2    2     8      2    3   --- ------ ----

User · Answer

The solution below does only one groupBy and extract the rows of your dataframe that contain the maxValue in one shot  No need for further Joins  or Windows   import org apache spark sql Row import org apache spark sql catalyst encoders RowEncoder import org apache spark sql DataFrame    df is the dataframe with Day  Category  TotalValue  implicit val dfEnc   RowEncoder df schema   val res  DataFrame   df groupByKey  r    gt  r getInt 0   mapGroups Row   day  Int  rows  Iterator Row     gt  i maxBy  r    gt  r getDouble 2

User · Answer

Window functions   Something like this should do the trick   import org apache spark sql functions  row number  max  broadcast  import org apache spark sql expressions Window  val df   sc parallelize Seq     0  cat26  30 9    0  cat13  22 1    0  cat95  19 6    0  cat105  1 3      1  cat67  28 5    1  cat4  26 8    1  cat13  12 6    1  cat23  5 3      2  cat56  39 6    2  cat40  29 7    2  cat187  27 9    2  cat68  9 8      3  cat8  35 6    toDF  Hour    Category    TotalValue    val w   Window partitionBy   hour   orderBy   TotalValue  desc   val dfTop   df withColumn  rn   row number over w   where   rn      1  drop  rn    dfTop show     ---- -------- ----------      Hour Category TotalValue      ---- -------- ----------         0    cat26       30 9         1    cat67       28 5         2    cat56       39 6         3     cat8       35 6      ---- -------- ----------    This method will be inefficient in case of significant data skew   Plain SQL aggregation followed by join   Alternatively you can join with aggregated data frame   val dfMax   df groupBy   hour  as  max hour    agg max   TotalValue   as  max value     val dfTopByJoin   df join broadcast dfMax          hour        max hour    amp  amp     TotalValue        max value       drop  max hour      drop  max value    dfTopByJoin show      ---- -------- ----------      Hour Category TotalValue      ---- -------- ----------         0    cat26       30 9         1    cat67       28 5         2    cat56       39 6         3     cat8       35 6      ---- -------- ----------    It will keep duplicate values  if there is more than one category per hour with the same total value   You can remove these as follows   dfTopByJoin    groupBy   hour      agg      first  category   alias  category        first  TotalValue   alias  TotalValue      Using ordering over structs   Neat  although not very well tested  trick which doesn t require joins or window functions   val dfTop   df select   Hour   struct   TotalValue     Category   alias  vs       groupBy   hour      agg max  vs   alias  vs       select   Hour     vs Category     vs TotalValue    dfTop show     ---- -------- ----------      Hour Category TotalValue      ---- -------- ----------         0    cat26       30 9         1    cat67       28 5         2    cat56       39 6         3     cat8       35 6      ---- -------- ----------    With DataSet API  Spark 1 6   2 0     Spark 1 6   case class Record Hour  Integer  Category  String  TotalValue  Double   df as Record     groupBy   hour      reduce  x  y    gt  if  x TotalValue  gt  y TotalValue  x else y     show      --- --------------        1              2      --- --------------       0   0 cat26 30 9        1   1 cat67 28 5        2   2 cat56 39 6        3    3 cat8 35 6       --- --------------    Spark 2 0 or later   df as Record     groupByKey   Hour     reduceGroups  x  y    gt  if  x TotalValue  gt  y TotalValue  x else y    The last two methods can leverage map side combine and don t require full shuffle so most of the time should exhibit a better performance compared to window functions and joins  These cane be also used with Structured Streaming in completed output mode   Don t use   df orderBy      groupBy      agg first              It may seem to work  especially in the local mode  but it is unreliable  see SPARK-16207  credits to Tzach Zohar for linking relevant JIRA issue  and SPARK-30335    The same note applies to   df orderBy      dropDuplicates        which internally uses equivalent execution plan

User · Answer

Here you can do like this -      val data   df groupBy  Hour   agg first  Hour   as   1   first  Category   as  Category   first  TotalValue   as  TotalValue    drop  Hour    data withColumnRenamed   1   Hour   show

User · Answer

The pattern is group by keys    do something to each group e g  reduce    return to dataframe  I thought the Dataframe abstraction is a bit cumbersome in this case so I used RDD functionality   val rdd  RDD Row    originalDf    rdd    groupBy row   gt  row getAs String   grouping row       map iterableTuple   gt        iterableTuple  2 reduce reduceFunction        val productDf   sqlContext createDataFrame rdd  originalDf schema

User · Answer

A nice way of doing this with the dataframe api is using the argmax logic like so    val df   Seq       0  cat26  30 9    0  cat13  22 1    0  cat95  19 6    0  cat105  1 3        1  cat67  28 5    1  cat4  26 8    1  cat13  12 6    1  cat23  5 3        2  cat56  39 6    2  cat40  29 7    2  cat187  27 9    2  cat68  9 8        3  cat8  35 6   toDF  Hour    Category    TotalValue      df groupBy   Hour        agg max struct   TotalValue     Category    as  argmax         select   Hour     argmax     show    ---- ---------- --------    Hour TotalValue Category    ---- ---------- --------       1       28 5    cat67       3       35 6     cat8       2       39 6    cat56       0       30 9    cat26    ---- ---------- --------

[sql] How to select the first row of each group?

Examples related to sql

Examples related to scala

Examples related to apache-spark

Examples related to dataframe

Examples related to apache-spark-sql