Apache Spark map vs mapPartitions

Question

What s the difference between an RDD s map and mapPartitions method  And does flatMap behave like map or like mapPartitions  Thanks    edit  i e  what is the difference  either semantically or in terms of execution  between    def map A  B  rdd  RDD A   fn   A   gt  B                   implicit a  Manifest A   b  Manifest B    RDD B          rdd mapPartitions   iter  Iterator A    gt  for  i  lt - iter  yield fn i           preservesPartitioning   true        And     def map A  B  rdd  RDD A   fn   A   gt  B                   implicit a  Manifest A   b  Manifest B    RDD B          rdd map fn

User · Answer

Imp  TIP       Whenever you have heavyweight initialization that should be done once   for many RDD elements rather than once per RDD element  and if this   initialization  such as creation of objects from a third-party   library  cannot be serialized  so that Spark can transmit it across   the cluster to the worker nodes   use mapPartitions   instead of   map    mapPartitions   provides for the initialization to be done   once per worker task thread partition instead of once per RDD data   element for example   see below    val newRd   myRdd mapPartitions partition   gt      val connection   new DbConnection   creates a db connection per partition      val newPartition   partition map record   gt        readMatchingFromDB record  connection       toList    consumes the iterator  thus calls readMatchingFromDB     connection close      close dbconnection here   newPartition iterator    create a new iterator           Q2  does flatMap behave like map or like mapPartitions    Yes  please see example 2 of flatmap   its self explanatory      Q1  What s the difference between an RDD s map and mapPartitions      map works the function being utilized at a per element level while   mapPartitions exercises the function at the partition level    Example Scenario    if we have 100K elements in a particular RDD partition then we will fire off the function being used by the mapping transformation 100K times when we use map     Conversely  if we use mapPartitions then we will only call the particular function one time  but we will pass in all 100K records and get back all responses in one function call   There will be performance gain since map works on a particular function so many times  especially if the function is doing something expensive each time that it wouldn t need to do if we passed in all the elements at once in case of mappartitions    map     Applies a transformation function on each item of the RDD and returns   the result as a new RDD       Listing Variants      def map U  ClassTag  f  T    U   RDD U    Example    val a   sc parallelize List  dog    salmon    salmon    rat    elephant    3   val b   a map   length   val c   a zip b   c collect  res0  Array  String  Int     Array  dog 3    salmon 6    salmon 6    rat 3    elephant 8      mapPartitions     This is a specialized map that is called only once for each partition    The entire content of the respective partitions is available as a   sequential stream of values via the input argument  Iterarator T      The custom function must return yet another Iterator U   The combined   result iterators are automatically converted into a new RDD  Please   note  that the tuples  3 4  and  6 7  are missing from the following   result due to the partitioning we chose       preservesPartitioning indicates whether the input function preserves the    partitioner  which should be false unless this is a pair RDD and the input    function doesn t modify the keys       Listing Variants      def mapPartitions U  ClassTag  f  Iterator T     Iterator U     preservesPartitioning  Boolean   false   RDD U    Example 1  val a   sc parallelize 1 to 9  3   def myfunc T  iter  Iterator T     Iterator  T  T          var res   List  T  T        var pre   iter next    while  iter hasNext            val cur   iter next       res       pre  cur       pre   cur          res iterator     a mapPartitions myfunc  collect  res0  Array  Int  Int     Array  2 3    1 2    5 6    4 5    8 9    7 8      Example 2  val x   sc parallelize List 1  2  3  4  5  6  7  8  9 10   3   def myfunc iter  Iterator Int     Iterator Int         var res   List Int       while  iter hasNext         val cur   iter next       res   res     List fill scala util Random nextInt 10   cur          res iterator     x mapPartitions myfunc  collect     some of the number are not outputted at all  This is because the random number generated for it is zero   res8  Array Int    Array 1  2  2  2  2  3  3  3  3  3  3  3  3  3  4  4  4  4  4  4  4  5  7  7  7  9  9  10     The above program can also be written using flatMap as follows   Example 2 using flatmap  val x    sc parallelize 1 to 10  3   x flatMap List fill scala util Random nextInt 10       collect   res1  Array Int    Array 1  2  3  3  3  4  4  4  4  4  4  4  4  4  5  5  6  6  6  6  6  6  6  6  7  7  7  8  8  8  8  8  8  8  8  9  9  9  9  9  10  10  10  10  10  10  10  10     Conclusion    mapPartitions transformation is faster than map since it calls your function once partition  not once element    Further reading   foreach Vs foreachPartitions When to use What

User · Answer

Map            It processes one row at a  time    very similar to map   method of MapReduce    You return from the transformation after every row       MapPartitions        It processes the complete partition in one go     You can return from the function only once after processing the whole partition     All intermediate results needs to be held in memory till you process the whole partition     Provides you like  setup   map   and cleanup   function of MapReduce         Map Vs mapPartitions    http   bytepadding com big-data spark spark-map-vs-mappartitions       Spark Map  http   bytepadding com big-data spark spark-map       Spark mapPartitions   http   bytepadding com big-data spark spark-mappartitions

User · Answer

Map  Map transformation  The map works on a single Row at a time  Map returns after each input Row  The map doesn   t hold the output result in Memory  Map no way to figure out then to end the service     map example  val dfList    1 to 100  toList  val df   dfList toDF    val dfInt   df map x   gt  x getInt 0  2   display dfInt   MapPartition  MapPartition transformation  MapPartition works on a partition at a time  MapPartition returns after processing all the rows in the partition  MapPartition output is retained in memory  as it can return after processing all the rows in a particular partition  MapPartition service can be shut down before returning     MapPartition example  Val dfList    1 to 100  toList  Val df   dfList toDF    Val df1   df repartition 4  rdd mapPartition  int    gt  Iterator itr length    Df1 collec      display df1 collect     For more details  please refer to the Spark map vs mapPartitions transformation article  Hope this is helpful

User · Answer

What s the difference between an RDD s map and mapPartitions method    The method map converts each element of the source RDD into a single element of the result RDD by applying a function  mapPartitions converts each partition of the source RDD into multiple elements of the result  possibly none       And does flatMap behave like map or like mapPartitions    Neither  flatMap works on a single element  as map  and produces multiple elements of the result  as mapPartitions

[performance] Apache Spark: map vs mapPartitions?

Imp. TIP :

map

mapPartitions

Conclusion :

Examples related to performance

Examples related to scala

Examples related to apache-spark

Examples related to rdd