Add jars to a Spark Job - spark-submit

Question

True     it has been discussed quite a lot   However there is a lot of ambiguity and some of the answers provided     including duplicating jar references in the jars executor driver configuration or options   The ambiguous and or omitted details  Following ambiguity  unclear  and or omitted details should be clarified for each option    How ClassPath is affected   Driver Executor  for tasks running  Both not at all  Separation character  comma  colon  semicolon If provided files are automatically distributed   for the tasks  to each executor  for the remote Driver  if ran in cluster mode   type of URI accepted  local file  hdfs  http  etc If copied into a common location  where that location is  hdfs  local     The options to which it affects     --jars SparkContext addJar      method SparkContext addFile      method  --conf spark driver extraClassPath     or --driver-class-path     --conf spark driver extraLibraryPath      or --driver-library-path     --conf spark executor extraClassPath     --conf spark executor extraLibraryPath     not to forget  the last parameter of the spark-submit is also a  jar file    I am aware where I can find the main spark documentation  and specifically about how to submit  the options available  and also the JavaDoc  However that left for me still quite some holes  although it answered partially too   I hope that it is not all that complex  and that someone can give me a clear and concise answer   If I were to guess from documentation  it seems that --jars  and the SparkContext addJar and addFile methods are the ones that will automatically distribute files  while the other options merely modify the ClassPath   Would it be safe to assume that for simplicity  I can add additional application jar files using the 3 main options at the same time   spark-submit --jar additional1 jar additional2 jar     --driver-library-path additional1 jar additional2 jar     --conf spark executor extraLibraryPath additional1 jar additional2 jar     --class MyClass main-application jar   Found a nice article on an answer to another posting  However nothing new learned  The poster does make a good remark on the difference between Local driver  yarn-client  and Remote Driver  yarn-cluster   Definitely important to keep in mind

User · Answer

When using spark-submit with --master yarn-cluster, the application jar along with any jars included with the --jars option will be automatically transferred to the cluster. URLs supplied after --jars must be separated by commas. That list is included in the driver and executor classpaths

Example :

spark-submit --master yarn-cluster --jars ../lib/misc.jar, ../lib/test.jar --class MainClass MainApp.jar

https://spark.apache.org/docs/latest/submitting-applications.html

User · Answer

Other configurable Spark option relating to jars and classpath  in case of yarn as deploy mode are as follows From the spark documentation      spark yarn jars      List of libraries containing Spark code to distribute to YARN containers        By default  Spark on YARN will use Spark jars installed locally  but the Spark jars can also be       in a world-readable location on HDFS  This allows YARN to cache it on nodes so that it doesn t       need to be distributed each time an application runs  To point to jars on HDFS  for example        set this configuration to hdfs    some path  Globs are allowed       spark yarn archive      An archive containing needed Spark jars for distribution to the YARN cache  If set  this       configuration replaces spark yarn jars and the archive is used in all the       application s containers  The archive should contain jar files in its root directory        Like with the previous option  the archive can also be hosted on HDFS to speed up file       distribution    Users can configure this parameter to specify their jars  which inturn gets included in Spark driver s classpath

User · Answer

There is restriction on using --jars  if you want to specify a directory for location of jar xml file  it doesn t allow directory expansions  This means if you need to specify absolute path for each jar   If you specify --driver-class-path and you are executing in yarn cluster mode  then driver class doesn t get updated  We can verify if class path is updated or not under spark ui or spark history server under tab environment   Option which worked for me to pass jars which contain directory expansions and which worked in yarn cluster mode was --conf option  It s better to pass driver and executor class paths as --conf  which adds them to spark session object itself and those paths are reflected on Spark Configuration  But Please make sure to put jars on the same path across the cluster   spark-submit     --master yarn     --queue spark queue     --deploy-mode cluster        --num-executors 12     --executor-memory 4g     --driver-memory 8g     --executor-cores 4     --conf spark ui enabled False     --conf spark driver extraClassPath  usr hdp current hbase-master lib hbase-server jar  usr hdp current hbase-master lib hbase-common jar  usr hdp current hbase-master lib hbase-client jar  usr hdp current hbase-master lib zookeeper jar  usr hdp current hbase-master lib hbase-protocol jar  usr hdp current spark2-thriftserver examples jars scopt 2 11-3 3 0 jar  usr hdp current spark2-thriftserver examples jars spark-examples 2 10-1 1 0 jar  etc hbase conf     --conf spark hadoop mapred output dir  tmp     --conf spark executor extraClassPath  usr hdp current hbase-master lib hbase-server jar  usr hdp current hbase-master lib hbase-common jar  usr hdp current hbase-master lib hbase-client jar  usr hdp current hbase-master lib zookeeper jar  usr hdp current hbase-master lib hbase-protocol jar  usr hdp current spark2-thriftserver examples jars scopt 2 11-3 3 0 jar  usr hdp current spark2-thriftserver examples jars spark-examples 2 10-1 1 0 jar  etc hbase conf     --conf spark hadoop mapreduce output fileoutputformat outputdir  tmp

User · Answer

ClassPath   ClassPath is affected depending on what you provide  There are a couple of ways to set something on the classpath    spark driver extraClassPath or it s alias --driver-class-path to set extra classpaths on the node running the driver  spark executor extraClassPath to set extra class path on the Worker nodes    If you want a certain JAR to be effected on both the Master and the Worker  you have to specify these separately in BOTH flags   Separation character   Following the same rules as the JVM    Linux  A colon     e g  --conf  spark driver extraClassPath  opt prog hadoop-aws-2 7 1 jar  opt prog aws-java-sdk-1 10 50 jar   Windows  A semicolon     e g  --conf  spark driver extraClassPath  opt prog hadoop-aws-2 7 1 jar  opt prog aws-java-sdk-1 10 50 jar     File distribution   This depends on the mode which you re running your job under    Client mode - Spark fires up a Netty HTTP server which distributes the files on start up for each of the worker nodes  You can see that when you start your Spark job   16 05 08 17 29 12 INFO HttpFileServer  HTTP File server directory is  tmp spark-48911afa-db63-4ffc-a298-015e8b96bc55 httpd-84ae312b-5863-4f4c-a1ea-537bfca2bc2b 16 05 08 17 29 12 INFO HttpServer  Starting HTTP Server 16 05 08 17 29 12 INFO Utils  Successfully started service  HTTP file server  on port 58922  16 05 08 17 29 12 INFO SparkContext  Added JAR  opt foo jar at http       58922 jars com mycode jar with timestamp 1462728552732 16 05 08 17 29 12 INFO SparkContext  Added JAR  opt aws-java-sdk-1 10 50 jar at http       58922 jars aws-java-sdk-1 10 50 jar with timestamp 1462728552767  Cluster mode - In cluster mode spark selected a leader Worker node to execute the Driver process on  This means the job isn t running directly from the Master node  Here  Spark will not set an HTTP server  You have to manually make your JARS available to all the worker node via HDFS S3 Other sources which are available to all nodes    Accepted URI s for files  In  Submitting Applications   the Spark documentation does a good job of explaining the accepted prefixes for files      When using spark-submit  the application jar along with any jars   included with the --jars option will be automatically transferred to   the cluster  Spark uses the following URL scheme to allow different   strategies for disseminating jars          file  - Absolute paths and file   URIs are served by the driver   s HTTP   file server  and every executor pulls the file from the driver HTTP   server     hdfs   http   https   ftp  - these pull down files and JARs   from the URI as expected    local  - a URI starting with local   is   expected to exist as a local file on each worker node  This means that   no network IO will be incurred  and works well for large files JARs   that are pushed to each worker  or shared via NFS  GlusterFS  etc          Note that JARs and files are copied to the working directory for each   SparkContext on the executor nodes    As noted  JARs are copied to the working directory for each Worker node  Where exactly is that  It is usually under  var run spark work  you ll see them like this   drwxr-xr-x    3 spark spark   4096 May 15 06 16 app-20160515061614-0027 drwxr-xr-x    3 spark spark   4096 May 15 07 04 app-20160515070442-0028 drwxr-xr-x    3 spark spark   4096 May 15 07 18 app-20160515071819-0029 drwxr-xr-x    3 spark spark   4096 May 15 07 38 app-20160515073852-0030 drwxr-xr-x    3 spark spark   4096 May 15 08 13 app-20160515081350-0031 drwxr-xr-x    3 spark spark   4096 May 18 17 20 app-20160518172020-0032 drwxr-xr-x    3 spark spark   4096 May 18 17 20 app-20160518172045-0033   And when you look inside  you ll see all the JARs you deployed along          cd  var run spark work app-20160508173423-0014 1         ll total 89988 -rwxr-xr-x 1 spark spark   801117 May  8 17 34 awscala 2 10-0 5 5 jar -rwxr-xr-x 1 spark spark 29558264 May  8 17 34 aws-java-sdk-1 10 50 jar -rwxr-xr-x 1 spark spark 59466931 May  8 17 34 com mycode code jar -rwxr-xr-x 1 spark spark  2308517 May  8 17 34 guava-19 0 jar -rw-r--r-- 1 spark spark      457 May  8 17 34 stderr -rw-r--r-- 1 spark spark        0 May  8 17 34 stdout   Affected options   The most important thing to understand is priority  If you pass any property via code  it will take precedence over any option you specify via spark-submit  This is mentioned in the Spark documentation      Any values specified as flags or in the properties file will be passed   on to the application and merged with those specified through   SparkConf  Properties set directly on the SparkConf take highest   precedence  then flags passed to spark-submit or spark-shell  then   options in the spark-defaults conf file   So make sure you set those values in the proper places  so you won t be surprised when one takes priority over the other   Lets analyze each option in question    --jars vs SparkContext addJar  These are identical  only one is set through spark submit and one via code  Choose the one which suites you better  One important thing to note is that using either of these options does not add the JAR to your driver executor classpath  you ll need to explicitly add them using the extraClassPath config on both  SparkContext addJar vs SparkContext addFile  Use the former when you have a dependency that needs to be used with your code  Use the latter when you simply want to pass an arbitrary file around to your worker nodes  which isn t a run-time dependency in your code  --conf spark driver extraClassPath     or --driver-class-path  These are aliases  doesn t matter which one you choose --conf spark driver extraLibraryPath      or --driver-library-path     Same as above  aliases  --conf spark executor extraClassPath      Use this when you have a dependency which can t be included in an uber JAR  for example  because there are compile time conflicts between library versions  and which you need to load at runtime  --conf spark executor extraLibraryPath     This is passed as the java library path option for the JVM  Use this when you need a library path visible to the JVM       Would it be safe to assume that for simplicity  I can add additional   application jar files using the 3 main options at the same time    You can safely assume this only for Client mode  not Cluster mode  As I ve previously said  Also  the example you gave has some redundant arguments  For example  passing JARs to --driver-library-path is useless  you need to pass them to extraClassPath if you want them to be on your classpath  Ultimately  what you want to do when you deploy external JARs on both the driver and the worker is   spark-submit --jars additional1 jar additional2 jar     --driver-class-path additional1 jar additional2 jar     --conf spark executor extraClassPath additional1 jar additional2 jar     --class MyClass main-application jar

User · Answer

Another approach in spark 2 1 0 is to use --conf spark driver userClassPathFirst true during spark-submit which changes the priority of dependency load  and thus the behavior of the spark-job  by giving priority to the jars the user is adding to the class-path with the --jars option

User · Answer

While we submit spark jobs using spark-submit utility  there is an option --jars   Using this option  we can pass jar file to spark applications

[java] Add jars to a Spark Job - spark-submit

Examples related to java

Examples related to scala

Examples related to apache-spark

Examples related to jar

Examples related to spark-submit