Spark Add column to dataframe conditionally

Question

I am trying to take my input data   A    B       C -------------- 4    blah    2 2            3 56   foo     3   And add a column to the end based on whether B is empty or not   A    B       C     D -------------------- 4    blah    2     1 2            3     0 56   foo     3     1   I can do this easily by registering the input dataframe as a temp table  then typing up a SQL query   But I d really like to know how to do this with just Scala methods and not having to type out a SQL query within Scala   I ve tried  withColumn  but I can t get that to do what I want

User · Accepted Answer

Try withColumn with the function when as follows   val sqlContext   new SQLContext sc  import sqlContext implicits      for  toDF  and     import org apache spark sql functions      for  when   val df   sc parallelize Seq  4   blah   2    2      3    56   foo   3    100  null  5         toDF  A    B    C    val newDf   df withColumn  D   when   B  isNull or   B          0  otherwise 1     newDf show   shows   --- ---- --- ---     A    B   C   D   --- ---- --- ---     4 blah   2   1     2        3   0    56  foo   3   1   100 null   5   0   --- ---- --- ---    I added the  100  null  5  row for testing the isNull case   I tried this code with Spark 1 6 0 but as commented in the code of when  it works on the versions after 1 4 0

User · Answer

My bad  I had missed one part of the question   Best  cleanest way is to use a UDF   Explanation within the code      create some example data   BY DataFrame    note  third record has an empty string case class Stuff a String b Int  val d  sc parallelize Seq    a  1    b  2            3     d  4   map   x   gt  Stuff x  1 x  2      toDF     now the good stuff  import org apache spark sql functions udf    function that returns 0 is string empty  val func   udf   s String    gt  if s isEmpty  0 else 1      create new dataframe with added column named  notempty  val r   d select    a     b   func   a   as  notempty          scala gt  r show  --- --- --------     a   b notempty   --- --- --------     a   1     1111     b   2     1111         3        0     d   4     1111   --- --- --------

User · Answer

How about something like this    val newDF   df filter   B          take 1  match     case Array     gt  df   case     gt  df withColumn  D     B              Using take 1  should have a minimal hit

[scala] Spark: Add column to dataframe conditionally

Examples related to scala

Examples related to apache-spark

Examples related to apache-spark-sql

Examples related to spark-dataframe