Spark - repartition vs coalesce

Question

According to Learning Spark     Keep in mind that repartitioning your data is a fairly expensive operation    Spark also has an optimized version of repartition   called coalesce   that allows avoiding data movement  but only if you are decreasing the number of RDD partitions    One difference I get is that with repartition   the number of partitions can be increased decreased  but with coalesce   the number of partitions can only be decreased   If the partitions are spread across multiple machines and coalesce   is run  how can it avoid data movement

User · Answer

Justin s answer is awesome and this response goes into more depth   The repartition algorithm does a full shuffle and creates new partitions with data that s distributed evenly   Let s create a DataFrame with the numbers from 1 to 12   val x    1 to 12  toList val numbersDf   x toDF  number     numbersDf contains 4 partitions on my machine   numbersDf rdd partitions size      gt  4   Here is how the data is divided on the partitions   Partition 00000  1  2  3 Partition 00001  4  5  6 Partition 00002  7  8  9 Partition 00003  10  11  12   Let s do a full-shuffle with the repartition method and get this data on two nodes   val numbersDfR   numbersDf repartition 2    Here is how the numbersDfR data is partitioned on my machine   Partition A  1  3  4  6  7  9  10  12 Partition B  2  5  8  11   The repartition method makes new partitions and evenly distributes the data in the new partitions  the data distribution is more even for larger data sets    Difference between coalesce and repartition  coalesce uses existing partitions to minimize the amount of data that s shuffled   repartition creates new partitions and does a full shuffle   coalesce results in partitions with different amounts of data  sometimes partitions that have much different sizes  and repartition results in roughly equal sized partitions   Is coalesce or repartition faster   coalesce may run faster than repartition  but unequal sized partitions are generally slower to work with than equal sized partitions   You ll usually need to repartition datasets after filtering a large data set   I ve found repartition to be faster overall because Spark is built to work with equal sized partitions   N B  I ve curiously observed that repartition can increase the size of data on disk   Make sure to run tests when you re using repartition   coalesce on large datasets   Read this blog post if you d like even more details   When you ll use coalesce  amp  repartition in practice   See this question on how to use coalesce  amp  repartition to write out a DataFrame to a single file It s critical to repartition after running filtering queries   The number of partitions does not change after filtering  so if you don t repartition  you ll have way too many memory partitions  the more the filter reduces the dataset size  the bigger the problem    Watch out for the empty partition problem  partitionBy is used to write out data in partitions on disk   You ll need to use repartition   coalesce to partition your data in memory properly before using partitionBy

User · Answer

For someone who had issues generating a single csv file from PySpark  AWS EMR  as an output and saving it on s3  using repartition helped  The reason being  coalesce cannot do a full shuffle  but repartition can  Essentially  you can increase or decrease the number of partitions using repartition  but can only decrease the number of partitions  but not 1  using coalesce  Here is the code for anyone who is trying to write a csv from AWS EMR to s3   df repartition 1  write format  csv     option  path    s3a   my bucket name location     save header    true

User · Answer

In a simple way COALESCE  - is only for decreases the no of partitions   No shuffling of data it just compress the partitions  REPARTITION - is for both increase and decrease the no of partitions   But shuffling takes place  Example -  val rdd   sc textFile  path  7  rdd repartition 10  rdd repartition 2    Both works fine  But we go generally  for this two things when we need to see output in one cluster we go with this

User · Answer

To all the great answers I would like to add that repartition is one the best option to take advantage of data parallelization  While coalesce gives a cheap option to reduce the partitions and it is very useful when writing data to HDFS or some other sink to take advantage of big writes   I have found this useful when writing data in parquet format to get full advantage

User · Answer

Also another difference is taking into consideration a situation where there is a skew join and you have to coalesce on top of it   A repartition will solve the skew join in most cases  then you can do the coalesce  Another situation is  suppose you have saved a medium large volume of data in a data frame and you have to produce to Kafka in batches   A repartition helps to collectasList before producing to Kafka in certain cases   But  when the volume is really high  the repartition will likely cause serious performance impact   In that case  producing to Kafka directly from dataframe would help  side notes  Coalesce does not avoid data movement as in full data movement between workers    It does reduce the number of shuffles happening though   I think that s what the book means

User · Answer

What follows from the code and code docs is that coalesce n  is the same as coalesce n  shuffle   false  and repartition n  is the same as coalesce n  shuffle   true   Thus  both coalesce and repartition can be used to increase number of partitions     With shuffle   true  you can actually coalesce to a larger number     of partitions  This is useful if you have a small number of partitions      say 100  potentially with a few partitions being abnormally large    Another important note to accentuate is that if you drastically decrease number of partitions you should consider using shuffled version of coalesce  same as repartition in that case   This will allow your computations be performed in parallel on parent partitions  multiple task       However  if you re doing a drastic coalesce  e g  to numPartitions   1  this may result in your computation taking place on fewer nodes than you like  e g  one node in the case of numPartitions   1   To avoid this  you can pass shuffle   true  This will add a shuffle step  but means the current upstream partitions will be executed in parallel  per whatever the current partitioning is     Please also refer to the related answer here

User · Answer

Basically Repartition allows you to increase or decrease the number of partitions  Repartition re-distributes the data from all the partitions and this leads to full shuffle which is very expensive operation  Coalesce is the optimized version of Repartition where you can only reduce the number of partitions  As we are only able to reduce the number of partitions what it does is merge some of the partitions to be a single partition  By merging partitions  the movement of the data across the partition is lower compared to Repartition  So in Coalesce is minimum data movement but saying that coalesce does not do data movement is completely wrong statement  Other thing is in repartition by providing the number of partitions  it tries to redistribute the data uniformly on all the partitions while in case of Coalesce we could still have skew data in some cases

User · Answer

One additional point to note here is that  as the basic principle of Spark RDD is immutability  The repartition or coalesce will create new RDD  The base RDD will continue to have existence with its original number of partitions  In case the use case demands to persist RDD in cache  then the same has to be done for the newly created RDD   scala gt  pairMrkt repartition 10  res16  org apache spark rdd RDD  String  Array String     MapPartitionsRDD 11  at repartition at  lt console gt  26  scala gt  res16 partitions length res17  Int   10  scala gt   pairMrkt partitions length res20  Int   2

User · Answer

repartition - it s recommended to use it while increasing the number of partitions  because it involve shuffling of all the data   coalesce - it s is recommended to use it while reducing the number of partitions  For example if you have 3 partitions and you want to reduce it to 2  coalesce will move the 3rd partition data to partition 1 and 2  Partition 1 and 2 will remains in the same container  On the other hand  repartition will shuffle data in all the partitions  therefore the network usage between the executors will be high and it will impacts the performance   coalesce performs better than repartition while reducing the number of partitions

User · Answer

But also you should make sure that  the data which is coming coalesce nodes should have highly configured  if you are dealing with huge data  Because all the data will be loaded to those nodes  may lead memory exception  Though reparation is costly  i prefer to use it  Since it shuffles and distribute the data equally   Be wise to select between coalesce and repartition

User · Answer

There is a use-case for repartition  gt  gt  coalesce even where the partition number decreases mentioned in  Rob s answer  that is writing data to a single file   Rob s answer hints in the good direction  but I think that some further explanation is needed to understand what s going on under the hood  If you need to filter your data before writing  then repartition is much more suitable than coalesce  since coalesce will be pushed-down right before the loading operation  For instance  load   map      filter      coalesce 1  save   translates to  load   coalesce 1  map      filter      save   This means that all your data will collapse into a single partition  where it will be filtered  losing all parallelism  This happens even for very simple filters like column  value   This does not happen with repartition  load   map      filter      repartition 1  save   In such case  filtering happens in parallel on the original partitions  Just to give an order of magnitude  in my case when filtering 109M rows   105G  with  1000 partitions after loading from a Hive table  the runtime dropped from the  6h for coalesce 1  to  2m for repartition 1   The specific example is taken from this article from AirBnB  which is pretty good and covers even more aspects of repartitioning techniques in Spark

User · Answer

All the answers are adding some great knowledge into this very often asked question   So going by tradition of this question s timeline  here are my 2 cents   I found the repartition to be faster than coalesce  in very specific case   In my application when the number of files that we estimate is lower than the certain threshold  repartition works faster    Here is what I mean  if numFiles  gt  20      df coalesce numFiles  write mode SaveMode Overwrite  parquet dest  else     df repartition numFiles  write mode SaveMode Overwrite  parquet dest    In above snippet  if my files were less than 20  coalesce was taking forever to finish while repartition was much faster and so the above code   Of course  this number  20  will depend on the number of workers and amount of data   Hope that helps

User · Answer

I would like to add to Justin and Power s answer that -   repartition will ignore existing partitions and create new ones  So you can use it to fix data skew  You can mention partition keys to define the distribution  Data skew is one of the biggest problems in the  big data  problem space   coalesce will work with existing partitions and shuffle a subset of them  It can t fix the data skew as much as repartition does  Therefore even if it is less expensive it might not be the thing you need

User · Answer

The repartition algorithm does a full shuffle of the data and creates equal sized partitions of data  coalesce combines existing partitions to avoid a full shuffle   Coalesce works well for taking an RDD with a lot of partitions and combining partitions on a single worker node to produce a final RDD with less partitions   Repartition will reshuffle the data in your RDD to produce the final number of partitions you request  The partitioning of DataFrames seems like a low level implementation detail that should be managed by the framework  but it   s not  When filtering large DataFrames into smaller ones  you should almost always repartition the data  You   ll probably be filtering large DataFrames into smaller ones frequently  so get used to repartitioning   Read this blog post if you d like even more details

User · Answer

It avoids a full shuffle  If it s known that the number is decreasing then the executor can safely keep data on the minimum number of partitions  only moving the data off the extra nodes  onto the nodes that we kept    So  it would go something like this   Node 1   1 2 3 Node 2   4 5 6 Node 3   7 8 9 Node 4   10 11 12   Then coalesce down to 2 partitions   Node 1   1 2 3    10 11 12  Node 3   7 8 9    4 5 6    Notice that Node 1 and Node 3 did not require its original data to move

User · Answer

Repartition  Shuffle the data into a NEW number of partitions   Eg  Initial data frame is partitioned in 200 partitions   df repartition 500   Data will be shuffled from 200 partitions to new 500 partitions   Coalesce  Shuffle the data into existing number of partitions   df coalesce 5   Data will be shuffled from remaining 195 partitions to 5 existing partitions

[apache-spark] Spark - repartition() vs coalesce()

Examples related to apache-spark

Examples related to distributed-computing

Examples related to rdd