[java] Add jars to a Spark Job - spark-submit

True ... it has been discussed quite a lot.

However there is a lot of ambiguity and some of the answers provided ... including duplicating jar references in the jars/executor/driver configuration or options.

The ambiguous and/or omitted details

Following ambiguity, unclear, and/or omitted details should be clarified for each option:

  • How ClassPath is affected
    • Driver
    • Executor (for tasks running)
    • Both
    • not at all
  • Separation character: comma, colon, semicolon
  • If provided files are automatically distributed
    • for the tasks (to each executor)
    • for the remote Driver (if ran in cluster mode)
  • type of URI accepted: local file, hdfs, http, etc
  • If copied into a common location, where that location is (hdfs, local?)

The options to which it affects :

  1. --jars
  2. SparkContext.addJar(...) method
  3. SparkContext.addFile(...) method
  4. --conf spark.driver.extraClassPath=... or --driver-class-path ...
  5. --conf spark.driver.extraLibraryPath=..., or --driver-library-path ...
  6. --conf spark.executor.extraClassPath=...
  7. --conf spark.executor.extraLibraryPath=...
  8. not to forget, the last parameter of the spark-submit is also a .jar file.

I am aware where I can find the main spark documentation, and specifically about how to submit, the options available, and also the JavaDoc. However that left for me still quite some holes, although it answered partially too.

I hope that it is not all that complex, and that someone can give me a clear and concise answer.

If I were to guess from documentation, it seems that --jars, and the SparkContext addJar and addFile methods are the ones that will automatically distribute files, while the other options merely modify the ClassPath.

Would it be safe to assume that for simplicity, I can add additional application jar files using the 3 main options at the same time:

spark-submit --jar additional1.jar,additional2.jar \
  --driver-library-path additional1.jar:additional2.jar \
  --conf spark.executor.extraLibraryPath=additional1.jar:additional2.jar \
  --class MyClass main-application.jar

Found a nice article on an answer to another posting. However nothing new learned. The poster does make a good remark on the difference between Local driver (yarn-client) and Remote Driver (yarn-cluster). Definitely important to keep in mind.

This question is related to java scala apache-spark jar spark-submit

The answer is


Other configurable Spark option relating to jars and classpath, in case of yarn as deploy mode are as follows
From the spark documentation,

spark.yarn.jars

List of libraries containing Spark code to distribute to YARN containers. By default, Spark on YARN will use Spark jars installed locally, but the Spark jars can also be in a world-readable location on HDFS. This allows YARN to cache it on nodes so that it doesn't need to be distributed each time an application runs. To point to jars on HDFS, for example, set this configuration to hdfs:///some/path. Globs are allowed.

spark.yarn.archive

An archive containing needed Spark jars for distribution to the YARN cache. If set, this configuration replaces spark.yarn.jars and the archive is used in all the application's containers. The archive should contain jar files in its root directory. Like with the previous option, the archive can also be hosted on HDFS to speed up file distribution.

Users can configure this parameter to specify their jars, which inturn gets included in Spark driver's classpath.


Another approach in spark 2.1.0 is to use --conf spark.driver.userClassPathFirst=true during spark-submit which changes the priority of dependency load, and thus the behavior of the spark-job, by giving priority to the jars the user is adding to the class-path with the --jars option.


When using spark-submit with --master yarn-cluster, the application jar along with any jars included with the --jars option will be automatically transferred to the cluster. URLs supplied after --jars must be separated by commas. That list is included in the driver and executor classpaths

Example :

spark-submit --master yarn-cluster --jars ../lib/misc.jar, ../lib/test.jar --class MainClass MainApp.jar

https://spark.apache.org/docs/latest/submitting-applications.html


While we submit spark jobs using spark-submit utility, there is an option --jars . Using this option, we can pass jar file to spark applications.


There is restriction on using --jars: if you want to specify a directory for location of jar/xml file, it doesn't allow directory expansions. This means if you need to specify absolute path for each jar.

If you specify --driver-class-path and you are executing in yarn cluster mode, then driver class doesn't get updated. We can verify if class path is updated or not under spark ui or spark history server under tab environment.

Option which worked for me to pass jars which contain directory expansions and which worked in yarn cluster mode was --conf option. It's better to pass driver and executor class paths as --conf, which adds them to spark session object itself and those paths are reflected on Spark Configuration. But Please make sure to put jars on the same path across the cluster.

spark-submit \
  --master yarn \
  --queue spark_queue \
  --deploy-mode cluster    \
  --num-executors 12 \
  --executor-memory 4g \
  --driver-memory 8g \
  --executor-cores 4 \
  --conf spark.ui.enabled=False \
  --conf spark.driver.extraClassPath=/usr/hdp/current/hbase-master/lib/hbase-server.jar:/usr/hdp/current/hbase-master/lib/hbase-common.jar:/usr/hdp/current/hbase-master/lib/hbase-client.jar:/usr/hdp/current/hbase-master/lib/zookeeper.jar:/usr/hdp/current/hbase-master/lib/hbase-protocol.jar:/usr/hdp/current/spark2-thriftserver/examples/jars/scopt_2.11-3.3.0.jar:/usr/hdp/current/spark2-thriftserver/examples/jars/spark-examples_2.10-1.1.0.jar:/etc/hbase/conf \
  --conf spark.hadoop.mapred.output.dir=/tmp \
  --conf spark.executor.extraClassPath=/usr/hdp/current/hbase-master/lib/hbase-server.jar:/usr/hdp/current/hbase-master/lib/hbase-common.jar:/usr/hdp/current/hbase-master/lib/hbase-client.jar:/usr/hdp/current/hbase-master/lib/zookeeper.jar:/usr/hdp/current/hbase-master/lib/hbase-protocol.jar:/usr/hdp/current/spark2-thriftserver/examples/jars/scopt_2.11-3.3.0.jar:/usr/hdp/current/spark2-thriftserver/examples/jars/spark-examples_2.10-1.1.0.jar:/etc/hbase/conf \
  --conf spark.hadoop.mapreduce.output.fileoutputformat.outputdir=/tmp

Examples related to java

Under what circumstances can I call findViewById with an Options Menu / Action Bar item? How much should a function trust another function How to implement a simple scenario the OO way Two constructors How do I get some variable from another class in Java? this in equals method How to split a string in two and store it in a field How to do perspective fixing? String index out of range: 4 My eclipse won't open, i download the bundle pack it keeps saying error log

Examples related to scala

Intermediate language used in scalac? Why does calling sumr on a stream with 50 tuples not complete Select Specific Columns from Spark DataFrame Joining Spark dataframes on the key Provide schema while reading csv file as a dataframe how to filter out a null value from spark dataframe Fetching distinct values on a column using Spark DataFrame Can't push to the heroku Spark - Error "A master URL must be set in your configuration" when submitting an app Add jars to a Spark Job - spark-submit

Examples related to apache-spark

Select Specific Columns from Spark DataFrame Select columns in PySpark dataframe What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism? How to find count of Null and Nan values for each column in a PySpark dataframe efficiently? Spark dataframe: collect () vs select () How does createOrReplaceTempView work in Spark? Spark difference between reduceByKey vs groupByKey vs aggregateByKey vs combineByKey Filter df when values matches part of a string in pyspark Filtering a pyspark dataframe using isin by exclusion Convert date from String to Date format in Dataframes

Examples related to jar

Running JAR file on Windows 10 Add jars to a Spark Job - spark-submit Deploying Maven project throws java.util.zip.ZipException: invalid LOC header (bad signature) java.lang.NoClassDefFoundError: com/fasterxml/jackson/core/JsonFactory Maven Error: Could not find or load main class Where can I download mysql jdbc jar from? Access restriction: The type 'Application' is not API (restriction on required library rt.jar) Create aar file in Android Studio Checking Maven Version Creating runnable JAR with Gradle

Examples related to spark-submit

Add jars to a Spark Job - spark-submit How to stop INFO messages displaying on spark console?