Apache Spark The number of cores vs the number of executors

Question

I m trying to understand the relationship of the number of cores and the number of executors when running a Spark job on YARN   The test environment is as follows    Number of data nodes  3 Data node machine spec    CPU  Core i7-4790    of cores  4    of threads  8  RAM  32GB  8GB x 4  HDD  8TB  2TB x 4   Network  1Gb Spark version  1 0 0 Hadoop version  2 4 0  Hortonworks HDP 2 1  Spark job flow  sc textFile -  filter -  map -  filter -  mapToPair -  reduceByKey -  map -  saveAsTextFile Input data   Type  single text file Size  165GB Number of lines  454 568 833  Output   Number of lines after second filter  310 640 717 Number of lines of the result file  99 848 268 Size of the result file  41GB    The job was run with following configurations    --master yarn-client --executor-memory 19G --executor-cores 7 --num-executors 3  executors per data node  use as much as cores  --master yarn-client --executor-memory 19G --executor-cores 4 --num-executors 3    of cores reduced  --master yarn-client --executor-memory 4G --executor-cores 2 --num-executors 12  less core  more executor    Elapsed times    50 min 15 sec 55 min 48 sec 31 min 23 sec   To my surprise   3  was much faster  I thought that  1  would be faster  since there would be less inter-executor communication when shuffling  Although   of cores of  1  is fewer than  3    of cores is not the key factor since 2  did perform well    Followings were added after pwilmot s answer    For the information  the performance monitor screen capture is as follows    Ganglia data node summary for  1  - job started at 04 37       Ganglia data node summary for  3  - job started at 19 47  Please ignore the graph before that time      The graph roughly divides into 2 sections    First  from start to reduceByKey  CPU intensive  no network activity Second  after reduceByKey  CPU lowers  network I O is done    As the graph shows   1  can use as much CPU power as it was given  So  it might not be the problem of the number of the threads   How to explain this result

User · Answer

From the excellent resources available at RStudio s Sparklyr package page   SPARK DEFINITIONS  It may be useful to provide some simple definitions for the Spark nomenclature  Node  A server Worker Node  A server that is part of the cluster and are available to run Spark jobs Master Node  The server that coordinates the Worker nodes  Executor  A sort of virtual machine inside a node  One Node can have multiple Executors  Driver Node  The Node that initiates the Spark session  Typically  this will be the server where sparklyr is located  Driver  Executor   The Driver Node will also show up in the Executor list

User · Answer

I haven t played with these settings myself so this is just speculation but if we think about this issue as normal cores and threads in a distributed system then in your cluster you can use up to 12 cores  4   3 machines  and 24 threads  8   3 machines    In your first two examples you are giving your job a fair number of cores  potential computation space  but the number of threads  jobs  to run on those cores is so limited that you aren t able to use much of the processing power allocated and thus the job is slower even though there is more computation resources allocated   you mention that your concern was in the shuffle step - while it is nice to limit the overhead in the shuffle step it is generally much more important to utilize the parallelization of the cluster   Think about the extreme case - a single threaded program with zero shuffle

User · Answer

Spark Dynamic allocation gives flexibility and allocates resources dynamically  In this number of min and max executors can be given  Also the number of executors that has to be launched at the starting of the application can also be given   Read below on the same       http   spark apache org docs latest configuration html dynamic-allocation

User · Answer

To hopefully make all of this a little more concrete  here   s a worked example of configuring a Spark app to use as much of the cluster as possible  Imagine a cluster with six nodes running NodeManagers  each equipped with 16 cores and 64GB of memory  The NodeManager capacities  yarn nodemanager resource memory-mb and yarn nodemanager resource cpu-vcores  should probably be set to 63   1024   64512  megabytes  and 15 respectively  We avoid allocating 100  of the resources to YARN containers because the node needs some resources to run the OS and Hadoop daemons  In this case  we leave a gigabyte and a core for these system processes  Cloudera Manager helps by accounting for these and configuring these YARN properties automatically  The likely first impulse would be to use --num-executors 6 --executor-cores 15 --executor-memory 63G  However  this is the wrong approach because  63GB   the executor memory overhead won   t fit within the 63GB capacity of the NodeManagers  The application master will take up a core on one of the nodes  meaning that there won   t be room for a 15-core executor on that node  15 cores per executor can lead to bad HDFS I O throughput  A better option would be to use --num-executors 17 --executor-cores 5 --executor-memory 19G  Why  This config results in three executors on all nodes except for the one with the AM  which will have two executors  --executor-memory was derived as  63 3 executors per node    21   21   0 07   1 47   21     1 47   19   The explanation was given in an article in Cloudera s blog  How-to  Tune Your Apache Spark Jobs  Part 2

User · Answer

I think one of the major reasons is locality  Your input file size is 165G  the file s related blocks certainly distributed over multiple DataNodes  more executors can avoid network copy   Try to set executor num equal blocks count  i think can be faster

User · Answer

Short answer  I think tgbaggio is right  You hit HDFS throughput limits on your executors    I think the answer here may be a little simpler than some of the recommendations here   The clue for me is in the cluster network graph  For run 1 the utilization is steady at  50 M bytes s   For run 3 the steady utilization is doubled  around 100 M bytes s   From the cloudera blog post shared by DzOrd  you can see this important quote      I   ve noticed that the HDFS client has trouble with tons of concurrent threads  A rough guess is that at most five tasks per executor can achieve full write throughput  so it   s good to keep the number of cores per executor below that number    So  let s do a few calculations see what performance we expect if that is true     Run 1  19 GB  7 cores  3 executors   3 executors x 7 threads   21 threads with 7 cores per executor  we expect limited IO to HDFS  maxes out at  5 cores  effective throughput    3 executors x 5 threads   15 threads   Run 3  4 GB  2 cores  12 executors   2 executors x 12 threads   24 threads 2 cores per executor  so hdfs throughput is ok effective throughput    12 executors x 2 threads   24 threads     If the job is 100  limited by concurrency  the number of threads    We would expect runtime to be perfectly inversely correlated with the number of threads   ratio num threads   nthread job1   nthread job3   15 24   0 625 inv ratio runtime   1  duration job1   duration job3    1  50 31    31 50   0 62   So ratio num threads    inv ratio runtime  and it looks like we are network limited   This same effect explains the difference between Run 1 and Run 2     Run 2  19 GB  4 cores  3 executors   3 executors x 4 threads   12 threads with 4 cores per executor  ok IO to HDFS effective throughput    3 executors x 4 threads   12 threads     Comparing the number of effective threads and the runtime   ratio num threads   nthread job2   nthread job1   12 15   0 8 inv ratio runtime   1  duration job2   duration job1    1  55 50    50 55   0 91   It s not as perfect as the last comparison  but we still see a similar drop in performance when we lose threads   Now for the last bit  why is it the case that we get better performance with more threads  esp  more threads than the number of CPUs   A good explanation of the difference between parallelism  what we get by dividing up data onto multiple CPUs  and concurrency  what we get when we use multiple threads to do work on a single CPU  is provided in this great post by Rob Pike  Concurrency is not parallelism   The short explanation is that if a Spark job is interacting with a file system or network the CPU spends a lot of time waiting on communication with those interfaces and not spending a lot of time actually  doing work    By giving those CPUs more than 1 task to work on at a time  they are spending less time waiting and more time working  and you see better performance

User · Answer

There is a small issue in the First two configurations i think  The concepts of threads and cores like follows  The concept of threading is if the cores are ideal then use that core to process the data  So the memory is not fully utilized in first two cases  If you want to bench mark this example choose the machines which has more than 10 cores on each machine  Then do the bench mark   But dont give more than 5 cores per executor there will be bottle neck on i o performance   So the best machines to do this bench marking might be data nodes which have 10 cores     Data node machine spec  CPU  Core i7-4790    of cores  10    of threads  20  RAM  32GB  8GB x 4  HDD  8TB  2TB x 4

User · Answer

As you run your spark app on top of HDFS  according to Sandy Ryza     I   ve noticed that the HDFS client has trouble with tons of concurrent   threads  A rough guess is that at most five tasks per executor can   achieve full write throughput  so it   s good to keep the number of   cores per executor below that number    So I believe that your first configuration is slower than third one is because of bad HDFS I O throughput

[hadoop] Apache Spark: The number of cores vs. the number of executors

Examples related to hadoop

Examples related to apache-spark

Examples related to yarn