dataframe how to groupBy count then filter on count in Scala

Question

Spark 1 4 1   I encounter a situation where grouping by a dataframe  then counting and filtering on the  count  column raises the exception below  import sqlContext implicits   import org apache spark sql    case class Paf x Int  val myData   Seq Paf 2   Paf 1   Paf 2   val df   sc parallelize myData  2  toDF     Then grouping and filtering   df groupBy  x   count      filter  count  gt   2      show     Throws an exception   java lang RuntimeException   1 7  failure        expected but   gt    found count  gt   2   Solution   Renaming the column makes the problem vanish  as I suspect there is no conflict with the interpolated  count  function   df groupBy  x   count      withColumnRenamed  count    n      filter  n  gt   2      show     So  is that a behavior to expect  a bug or is there a canonical way to go around   thanks  alex

User · Answer

When you pass a string to the filter function  the string is interpreted as SQL  Count is a SQL keyword and using count as a variable confuses the parser  This is a small bug  you can file a JIRA ticket if you want to    You can easily avoid this by using a column expression instead of a String   df groupBy  x   count      filter   count   gt   2     show

User · Answer

So  is that a behavior to expect  a bug    Truth be told I am not sure  It looks like parser is interpreting count not as a column name but a function and expects following parentheses  Looks like a bug or at least a serious limitation of the parser      is there a canonical way to go around    Some options have been already mentioned by Herman and mattinbits so here more SQLish approach from me   import org apache spark sql functions count  df groupBy  x   agg count      alias  cnt    where   cnt    gt  2

User · Answer

I think a solution is to put count in back ticks    filter   count   gt   2     http   mail-archives us apache org mod mbox spark-user 201507 mbox  3C8E43A71610EAA94A9171F8AFCC44E351B48EDF fmsmsx124 amr corp intel com 3E

[scala] dataframe: how to groupBy/count then filter on count in Scala

Examples related to scala

Examples related to apache-spark

Examples related to apache-spark-sql