ReduceByKey reduceByKey(func, [numTasks])
-
Data is combined so that at each partition there should be at least one value for each key. And then shuffle happens and it is sent over the network to some particular executor for some action such as reduce.
GroupByKey - groupByKey([numTasks])
It doesn't merge the values for the key but directly the shuffle process happens and here lot of data gets sent to each partition, almost same as the initial data.
And the merging of values for each key is done after the shuffle. Here lot of data stored on final worker node so resulting in out of memory issue.
AggregateByKey - aggregateByKey(zeroValue)(seqOp, combOp, [numTasks])
It is similar to reduceByKey but you can provide initial values when performing aggregation.
Use of reduceByKey
reduceByKey
can be used when we run on large data set.
reduceByKey
when the input and output value types are of same type
over aggregateByKey
Moreover it recommended not to use groupByKey
and prefer reduceByKey
. For details you can refer here.
You can also refer this question to understand in more detail how reduceByKey
and aggregateByKey
.