True ... it has been discussed quite a lot.
However there is a lot of ambiguity and some of the answers provided ... including duplicating jar references in the jars/executor/driver configuration or options.
Following ambiguity, unclear, and/or omitted details should be clarified for each option:
--jars
SparkContext.addJar(...)
methodSparkContext.addFile(...)
method --conf spark.driver.extraClassPath=...
or --driver-class-path ...
--conf spark.driver.extraLibraryPath=...
, or --driver-library-path ...
--conf spark.executor.extraClassPath=...
--conf spark.executor.extraLibraryPath=...
I am aware where I can find the main spark documentation, and specifically about how to submit, the options available, and also the JavaDoc. However that left for me still quite some holes, although it answered partially too.
I hope that it is not all that complex, and that someone can give me a clear and concise answer.
If I were to guess from documentation, it seems that --jars
, and the SparkContext
addJar
and addFile
methods are the ones that will automatically distribute files, while the other options merely modify the ClassPath.
Would it be safe to assume that for simplicity, I can add additional application jar files using the 3 main options at the same time:
spark-submit --jar additional1.jar,additional2.jar \
--driver-library-path additional1.jar:additional2.jar \
--conf spark.executor.extraLibraryPath=additional1.jar:additional2.jar \
--class MyClass main-application.jar
Found a nice article on an answer to another posting. However nothing new learned. The poster does make a good remark on the difference between Local driver (yarn-client) and Remote Driver (yarn-cluster). Definitely important to keep in mind.
This question is related to
java
scala
apache-spark
jar
spark-submit
Other configurable Spark option relating to jars and classpath, in case of yarn
as deploy mode are as follows
From the spark documentation,
spark.yarn.jars
List of libraries containing Spark code to distribute to YARN containers. By default, Spark on YARN will use Spark jars installed locally, but the Spark jars can also be in a world-readable location on HDFS. This allows YARN to cache it on nodes so that it doesn't need to be distributed each time an application runs. To point to jars on HDFS, for example, set this configuration to hdfs:///some/path. Globs are allowed.
spark.yarn.archive
An archive containing needed Spark jars for distribution to the YARN cache. If set, this configuration replaces spark.yarn.jars and the archive is used in all the application's containers. The archive should contain jar files in its root directory. Like with the previous option, the archive can also be hosted on HDFS to speed up file distribution.
Users can configure this parameter to specify their jars, which inturn gets included in Spark driver's classpath.
Another approach in spark 2.1.0
is to use --conf spark.driver.userClassPathFirst=true
during spark-submit which changes the priority of dependency load, and thus the behavior of the spark-job, by giving priority to the jars the user is adding to the class-path with the --jars
option.
When using spark-submit with --master yarn-cluster, the application jar along with any jars included with the --jars option will be automatically transferred to the cluster. URLs supplied after --jars must be separated by commas. That list is included in the driver and executor classpaths
Example :
spark-submit --master yarn-cluster --jars ../lib/misc.jar, ../lib/test.jar --class MainClass MainApp.jar
https://spark.apache.org/docs/latest/submitting-applications.html
While we submit spark jobs using spark-submit utility, there is an option --jars
. Using this option, we can pass jar file to spark applications.
There is restriction on using --jars
: if you want to specify a directory for location of jar/xml
file, it doesn't allow directory expansions. This means if you need to specify absolute path for each jar.
If you specify --driver-class-path
and you are executing in yarn cluster mode, then driver class doesn't get updated. We can verify if class path is updated or not under spark ui or spark history server under tab environment.
Option which worked for me to pass jars which contain directory expansions and which worked in yarn cluster mode was --conf
option. It's better to pass driver and executor class paths as --conf
, which adds them to spark session object itself and those paths are reflected on Spark Configuration. But Please make sure to put jars on the same path across the cluster.
spark-submit \
--master yarn \
--queue spark_queue \
--deploy-mode cluster \
--num-executors 12 \
--executor-memory 4g \
--driver-memory 8g \
--executor-cores 4 \
--conf spark.ui.enabled=False \
--conf spark.driver.extraClassPath=/usr/hdp/current/hbase-master/lib/hbase-server.jar:/usr/hdp/current/hbase-master/lib/hbase-common.jar:/usr/hdp/current/hbase-master/lib/hbase-client.jar:/usr/hdp/current/hbase-master/lib/zookeeper.jar:/usr/hdp/current/hbase-master/lib/hbase-protocol.jar:/usr/hdp/current/spark2-thriftserver/examples/jars/scopt_2.11-3.3.0.jar:/usr/hdp/current/spark2-thriftserver/examples/jars/spark-examples_2.10-1.1.0.jar:/etc/hbase/conf \
--conf spark.hadoop.mapred.output.dir=/tmp \
--conf spark.executor.extraClassPath=/usr/hdp/current/hbase-master/lib/hbase-server.jar:/usr/hdp/current/hbase-master/lib/hbase-common.jar:/usr/hdp/current/hbase-master/lib/hbase-client.jar:/usr/hdp/current/hbase-master/lib/zookeeper.jar:/usr/hdp/current/hbase-master/lib/hbase-protocol.jar:/usr/hdp/current/spark2-thriftserver/examples/jars/scopt_2.11-3.3.0.jar:/usr/hdp/current/spark2-thriftserver/examples/jars/spark-examples_2.10-1.1.0.jar:/etc/hbase/conf \
--conf spark.hadoop.mapreduce.output.fileoutputformat.outputdir=/tmp
Source: Stackoverflow.com