[apache-spark] Spark difference between reduceByKey vs groupByKey vs aggregateByKey vs combineByKey

ReduceByKey reduceByKey(func, [numTasks])-

Data is combined so that at each partition there should be at least one value for each key. And then shuffle happens and it is sent over the network to some particular executor for some action such as reduce.

GroupByKey - groupByKey([numTasks])

It doesn't merge the values for the key but directly the shuffle process happens and here lot of data gets sent to each partition, almost same as the initial data.

And the merging of values for each key is done after the shuffle. Here lot of data stored on final worker node so resulting in out of memory issue.

AggregateByKey - aggregateByKey(zeroValue)(seqOp, combOp, [numTasks]) It is similar to reduceByKey but you can provide initial values when performing aggregation.

Use of reduceByKey

  • reduceByKey can be used when we run on large data set.

  • reduceByKey when the input and output value types are of same type over aggregateByKey

Moreover it recommended not to use groupByKey and prefer reduceByKey. For details you can refer here.

You can also refer this question to understand in more detail how reduceByKey and aggregateByKey.