[apache-spark] Is it possible to get the current spark context settings in PySpark?

I'm trying to get the path to spark.worker.dir for the current sparkcontext.

If I explicitly set it as a config param, I can read it back out of SparkConf, but is there anyway to access the complete config (including all defaults) using PySpark?

This question is related to apache-spark config pyspark

The answer is


Just for the records the analogous java version:

Tuple2<String, String> sc[] = sparkConf.getAll();
for (int i = 0; i < sc.length; i++) {
    System.out.println(sc[i]);
}

For Spark 2+ you can also use when using scala

spark.conf.getAll; //spark as spark session 

Simply running

sc.getConf().getAll()

should give you a list with all settings.


Not sure if you can get all the default settings easily, but specifically for the worker dir, it's quite straigt-forward:

from pyspark import SparkFiles
print SparkFiles.getRootDirectory()

You can use:

sc.sparkContext.getConf.getAll

For example, I often have the following at the top of my Spark programs:

logger.info(sc.sparkContext.getConf.getAll.mkString("\n"))

Suppose I want to increase the driver memory in runtime using Spark Session:

s2 = SparkSession.builder.config("spark.driver.memory", "29g").getOrCreate()

Now I want to view the updated settings:

s2.conf.get("spark.driver.memory")

To get all the settings, you can make use of spark.sparkContext._conf.getAll()

UPDATED SETTINGS

Hope this helps


Unfortunately, no, the Spark platform as of version 2.3.1 does not provide any way to programmatically access the value of every property at run time. It provides several methods to access the values of properties that were explicitly set through a configuration file (like spark-defaults.conf), set through the SparkConf object when you created the session, or set through the command line when you submitted the job, but none of these methods will show the default value for a property that was not explicitly set. For completeness, the best options are:

  • The Spark application’s web UI, usually at http://<driver>:4040, has an “Environment” tab with a property value table.
  • The SparkContext keeps a hidden reference to its configuration in PySpark, and the configuration provides a getAll method: spark.sparkContext._conf.getAll().
  • Spark SQL provides the SET command that will return a table of property values: spark.sql("SET").toPandas(). You can also use SET -v to include a column with the property’s description.

(These three methods all return the same data on my cluster.)


For a complete overview of your Spark environment and configuration I found the following code snippets useful:

SparkContext:

for item in sorted(sc._conf.getAll()): print(item)

Hadoop Configuration:

hadoopConf = {}
iterator = sc._jsc.hadoopConfiguration().iterator()
while iterator.hasNext():
    prop = iterator.next()
    hadoopConf[prop.getKey()] = prop.getValue()
for item in sorted(hadoopConf.items()): print(item)

Environment variables:

import os
for item in sorted(os.environ.items()): print(item)

Spark 2.1+

spark.sparkContext.getConf().getAll() where spark is your sparksession (gives you a dict with all configured settings)


update configuration in Spark 2.3.1

To change the default spark configurations you can follow these steps:

Import the required classes

from pyspark.conf import SparkConf
from pyspark.sql import SparkSession

Get the default configurations

spark.sparkContext._conf.getAll()

Update the default configurations

conf = spark.sparkContext._conf.setAll([('spark.executor.memory', '4g'), ('spark.app.name', 'Spark Updated Conf'), ('spark.executor.cores', '4'), ('spark.cores.max', '4'), ('spark.driver.memory','4g')])

Stop the current Spark Session

spark.sparkContext.stop()

Create a Spark Session

spark = SparkSession.builder.config(conf=conf).getOrCreate()

I would suggest you try the method below in order to get the current spark context settings.

SparkConf.getAll()

as accessed by

SparkContext.sc._conf

Get the default configurations specifically for Spark 2.1+

spark.sparkContext.getConf().getAll() 

Stop the current Spark Session

spark.sparkContext.stop()

Create a Spark Session

spark = SparkSession.builder.config(conf=conf).getOrCreate()

If you want to see the configuration in data bricks use the below command

spark.sparkContext._conf.getAll()

Spark 1.6+

sc.getConf.getAll.foreach(println)

Examples related to apache-spark

Select Specific Columns from Spark DataFrame Select columns in PySpark dataframe What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism? How to find count of Null and Nan values for each column in a PySpark dataframe efficiently? Spark dataframe: collect () vs select () How does createOrReplaceTempView work in Spark? Spark difference between reduceByKey vs groupByKey vs aggregateByKey vs combineByKey Filter df when values matches part of a string in pyspark Filtering a pyspark dataframe using isin by exclusion Convert date from String to Date format in Dataframes

Examples related to config

How to create multiple output paths in Webpack config Get environment value in controller Is it possible to get the current spark context settings in PySpark? How to check if ZooKeeper is running or up from command prompt? Spring Boot and multiple external configuration files How change default SVN username and password to commit changes? How to use ConfigurationManager Python: How would you save a simple settings/config file? Using logging in multiple modules Why Git is not allowing me to commit even after configuration?

Examples related to pyspark

Pyspark: Filter dataframe based on multiple conditions How to convert column with string type to int form in pyspark data frame? Select columns in PySpark dataframe How to find count of Null and Nan values for each column in a PySpark dataframe efficiently? Filter df when values matches part of a string in pyspark Filtering a pyspark dataframe using isin by exclusion PySpark: withColumn() with two conditions and three outcomes How to get name of dataframe column in pyspark? Spark RDD to DataFrame python PySpark 2.0 The size or shape of a DataFrame