[sorting] What is the purpose of shuffling and sorting phase in the reducer in Map Reduce Programming?

In Map Reduce programming the reduce phase has shuffling, sorting and reduce as its sub-parts. Sorting is a costly affair.

What is the purpose of shuffling and sorting phase in the reducer in Map Reduce Programming?

This question is related to sorting hadoop mapreduce hdfs shuffle

The answer is


I've always assumed this was necessary as the output from the mapper is the input for the reducer, so it was sorted based on the keyspace and then split into buckets for each reducer input. You want to ensure all the same values of a Key end up in the same bucket going to the reducer so they are reduced together. There is no point sending K1,V2 and K1,V4 to different reducers as they need to be together in order to be reduced.

Tried explaining it as simply as possible


There only two things that MapReduce does NATIVELY: Sort and (implemented by sort) scalable GroupBy.

Most of applications and Design Patterns over MapReduce are built over these two operations, which are provided by shuffle and sort.


Shuffling is the process by which intermediate data from mappers are transferred to 0,1 or more reducers. Each reducer receives 1 or more keys and its associated values depending on the number of reducers (for a balanced load). Further the values associated with each key are locally sorted.


Let's revisit key phases of Mapreduce program.

The map phase is done by mappers. Mappers run on unsorted input key/values pairs. Each mapper emits zero, one, or multiple output key/value pairs for each input key/value pairs.

The combine phase is done by combiners. The combiner should combine key/value pairs with the same key. Each combiner may run zero, once, or multiple times.

The shuffle and sort phase is done by the framework. Data from all mappers are grouped by the key, split among reducers and sorted by the key. Each reducer obtains all values associated with the same key. The programmer may supply custom compare functions for sorting and a partitioner for data split.

The partitioner decides which reducer will get a particular key value pair.

The reducer obtains sorted key/[values list] pairs, sorted by the key. The value list contains all values with the same key produced by mappers. Each reducer emits zero, one or multiple output key/value pairs for each input key/value pair.

Have a look at this javacodegeeks article by Maria Jurcovicova and mssqltips article by Datta for a better understanding

Below is the image from safaribooksonline article

enter image description here


This is a good reading. Hope it helps. In terms of sorting you are concerning, I think it is for the merge operation in last step of Map. When map operation is done, and need to write the result to local disk, a multi-merge will be operated on the splits generated from buffer. And for a merge operation, sorting each partition in advanced is helpful.


I thought of just adding some points missing in above answers. This diagram taken from here clearly states the what's really going on.

enter image description here

If I state again the real purpose of

  • Split: Improves the parallel processing by distributing the processing load across different nodes (Mappers), which would save the overall processing time.

  • Combine: Shrinks the output of each Mapper. It would save the time spending for moving the data from one node to another.

  • Sort (Shuffle & Sort): Makes it easy for the run-time to schedule (spawn/start) new reducers, where while going through the sorted item list, whenever the current key is different from the previous, it can spawn a new reducer.


Well, In Mapreduce there are two important phrases called Mapper and reducer both are too important, but Reducer is mandatory. In some programs reducers are optional. Now come to your question. Shuffling and sorting are two important operations in Mapreduce. First Hadoop framework takes structured/unstructured data and separate the data into Key, Value.

Now Mapper program separate and arrange the data into keys and values to be processed. Generate Key 2 and value 2 values. This values should process and re arrange in proper order to get desired solution. Now this shuffle and sorting done in your local system (Framework take care it) and process in local system after process framework cleanup the data in local system. Ok

Here we use combiner and partition also to optimize this shuffle and sort process. After proper arrangement, those key values passes to Reducer to get desired Client's output. Finally Reducer get desired output.

K1, V1 -> K2, V2 (we will write program Mapper), -> K2, V' (here shuffle and soft the data) -> K3, V3 Generate the output. K4,V4.

Please note all these steps are logical operation only, not change the original data.

Your question: What is the purpose of shuffling and sorting phase in the reducer in Map Reduce Programming?

Short answer: To process the data to get desired output. Shuffling is aggregate the data, reduce is get expected output.


Some of the data processing requirements doesn't need sort at all. Syncsort had made the sorting in Hadoop pluggable. Here is a nice blog from them on sorting. The process of moving the data from the mappers to the reducers is called shuffling, check this article for more information on the same.


Examples related to sorting

Sort Array of object by object field in Angular 6 Sorting a list with stream.sorted() in Java How to sort dates from Oldest to Newest in Excel? how to sort pandas dataframe from one column Reverse a comparator in Java 8 Find the unique values in a column and then sort them pandas groupby sort within groups pandas groupby sort descending order Efficiently sorting a numpy array in descending order? Swift: Sort array of objects alphabetically

Examples related to hadoop

Hadoop MapReduce: Strange Result when Storing Previous Value in Memory in a Reduce Class (Java) What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism? How to check Spark Version What are the pros and cons of parquet format compared to other formats? java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient How to export data from Spark SQL to CSV How to copy data from one HDFS to another HDFS? How to calculate Date difference in Hive Select top 2 rows in Hive Spark - load CSV file as DataFrame?

Examples related to mapreduce

Hadoop MapReduce: Strange Result when Storing Previous Value in Memory in a Reduce Class (Java) Java8: HashMap<X, Y> to HashMap<X, Z> using Stream / Map-Reduce / Collector What is the purpose of shuffling and sorting phase in the reducer in Map Reduce Programming? Container is running beyond memory limits Hive ParseException - cannot recognize input near 'end' 'string' Count lines in large files Good MapReduce examples What is Hive: Return Code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask Setting the number of map tasks and reduce tasks Map and Reduce in .NET

Examples related to hdfs

What are the pros and cons of parquet format compared to other formats? How to copy data from one HDFS to another HDFS? Spark - load CSV file as DataFrame? hadoop copy a local file system folder to HDFS What is the purpose of shuffling and sorting phase in the reducer in Map Reduce Programming? How to fix corrupt HDFS FIles How to copy file from HDFS to the local file system Name node is in safe mode. Not able to leave Hive load CSV with commas in quoted fields Permission denied at hdfs

Examples related to shuffle

Shuffle DataFrame rows What is the purpose of shuffling and sorting phase in the reducer in Map Reduce Programming? Better way to shuffle two numpy arrays in unison How to randomize (shuffle) a JavaScript array? How can I shuffle the lines of a text file on the Unix command line or in a shell script? Random shuffling of an array Shuffling a list of objects Shuffle an array with python, randomize array item order with python