How to sum the values of one column of a dataframe in spark scala

Question

I have a Dataframe that I read from a CSV file with many columns like  timestamp  steps  heartrate etc   I want to sum the values of each column  for instance the total number of steps on  steps  column   As far as I see I want to use these kind of functions   http   spark apache org docs latest api scala index html org apache spark sql functions   But I can understand how to use the function sum    When I write the following   val df   CSV load args 0   val sumSteps   df sum  steps      the function sum cannot be resolved    Do I use the function sum wrongly   Do   need to use first the function map  and if yes how   A simple example would be very helpful  I started writing Scala recently

User · Answer

Simply apply aggregation function  Sum on your column  df groupby  steps   sum   show     Follow the Documentation http   spark apache org docs 2 1 0 api python pyspark sql html  Check out this link also https   www analyticsvidhya com blog 2016 10 spark-dataframe-and-operations

User · Answer

Not sure this was around when this question was asked but   df describe   show  columnName     gives mean  count  stdtev stats on a column  I think it returns on all columns if you just do  show

User · Answer

Using spark sql query  just incase if it helps anyone   import org apache spark sql SparkSession  import org apache spark SparkConf  import org apache spark sql functions    import org apache spark SparkContext  import java util stream Collectors  val conf   new SparkConf   setMaster  local 2    setAppName  test   val spark   SparkSession builder config conf  getOrCreate   val df   spark sparkContext parallelize Seq 1  2  3  4  5  6  7   toDF    df createOrReplaceTempView  steps   val sum   spark sql  select  sum steps  as stepsSum from steps   map row   gt  row getAs  stepsSum   asInstanceOf Long   collect   0  println  steps sum       sum    prints 28

User · Answer

You must first import the functions   import org apache spark sql functions     Then you can use them like this   val df   CSV load args 0   val sumSteps    df agg sum  steps    first get 0    You can also cast the result if needed   val sumSteps  Long   df agg sum  steps   cast  long    first getLong 0    Edit   For multiple columns  e g   col1    col2         you could get all aggregations at once   val sums   df agg sum  col1   as  sum col1    sum  col2   as  sum col2         first   Edit2   For dynamically applying the aggregations  the following options are available    Applying to all numeric columns at once      df groupBy   sum      Applying to a list of numeric column names    val columnNames   List  col1    col2   df groupBy   sum columnNames         Applying to a list of numeric column names with aliases and or casts    val cols   List  col1    col2   val sums   cols map colName   gt  sum colName  cast  double   as  sum     colName   df groupBy   agg sums head  sums tail     show

User · Answer

If you want to sum all values of one column  it s more efficient to use DataFrame s internal RDD and reduce   import sqlContext implicits   import org apache spark sql functions    val df   sc parallelize Array 10 2 3 4   toDF  steps   df select col  steps    rdd map   0  asInstanceOf Int   reduce         res1 Int   19

[scala] How to sum the values of one column of a dataframe in spark/scala

Examples related to scala

Examples related to apache-spark