I'm using spark 1.4.0-rc2 so I can use python 3 with spark. If I add export PYSPARK_PYTHON=python3
to my .bashrc file, I can run spark interactively with python 3. However, if I want to run a standalone program in local mode, I get an error:
Exception: Python in worker has different version 3.4 than that in driver 2.7, PySpark cannot run with different minor versions
How can I specify the version of python for the driver? Setting export PYSPARK_DRIVER_PYTHON=python3
didn't work.
This question is related to
apache-spark
pyspark
"Exception: Python in worker has different version 2.6 than that in driver 2.7, PySpark cannot run with different minor versions".
Edit this file: /opt/cloudera/parcels/cdh5.5.4.p0.9/lib/spark/conf/spark-env.sh
Add these lines:
export PYSPARK_PYTHON=/usr/bin/python
export PYSPARK_DRIVER_PYTHON=python
I had the same problem, just forgot to activate my virtual environment. For anyone out there who also had a mental blank.
If you're running Spark in a larger organization and are unable to update the /spark-env.sh file, exporting the environment variables may not work.
You can add the specific Spark settings through the --conf
option when submitting the job at run time.
pyspark --master yarn --[other settings]\
--conf "spark.pyspark.python=/your/python/loc/bin/python"\
--conf "spark.pyspark.driver.python=/your/python/loc/bin/python"
Run:
ls -l /usr/local/bin/python*
The first row in this example shows the python3 symlink. To set it as the default python symlink run the following:
ln -s -f /usr/local/bin/python3 /usr/local/bin/python
then reload your shell.
In case you only want to change the python version for current task, you can use following pyspark start command:
PYSPARK_DRIVER_PYTHON=/home/user1/anaconda2/bin/python PYSPARK_PYTHON=/usr/local/anaconda2/bin/python pyspark --master ..
I got the same issue on standalone spark in windows. My version of fix is like this: I had my environment variables setting as bellow
PYSPARK_SUBMIT_ARGS="pyspark-shell"
PYSPARK_DRIVER_PYTHON=jupyter
PYSPARK_DRIVER_PYTHON_OPTS='notebook' pyspark
With this setting I executed an Action on pyspark and got the following exception:
Python in worker has different version 3.6 than that in driver 3.5, PySpark cannot run with different minor versions.
Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
To check with which python version my spark-worker is using hit the following in the cmd prompt.
python --version
Python 3.6.3
which showed me Python 3.6.3. So clearly my spark-worker is using system python which is v3.6.3.
Now as I set my spark-driver to run jupyter by setting PYSPARK_DRIVER_PYTHON=jupyter
so I need to check the python version jupyter is using.
To do this check open Anaconda Prompt and hit
python --version
Python 3.5.X :: Anaconda, Inc.
Here got the jupyter python is using the v3.5.x. You can check this version also in any Notebook (Help->About).
Now I need to update the jupyter python to the version v3.6.6. To do that open up the Anaconda Prompt and hit
conda search python
This will give you a list of available python versions in Anaconda. Install your desired one with
conda install python=3.6.3
Now I have both of the Python installation of same version 3.6.3 Spark should not comply and it didn't when I ran an Action on Spark-driver. Exception is gone. Happy coding ...
I just faced the same issue and these are the steps that I follow in order to provide Python version. I wanted to run my PySpark jobs with Python 2.7 instead of 2.6.
Go to the folder where $SPARK_HOME
is pointing to (in my case is /home/cloudera/spark-2.1.0-bin-hadoop2.7/
)
Under folder conf
, there is a file called spark-env.sh
. In case you have a file called spark-env.sh.template
you will need to copy the file to a new file called spark-env.sh
.
Edit the file and write the next three lines
export PYSPARK_PYTHON=/usr/local/bin/python2.7
export PYSPARK_DRIVER_PYTHON=/usr/local/bin/python2.7
export SPARK_YARN_USER_ENV="PYSPARK_PYTHON=/usr/local/bin/python2.7"
Save it and launch your application again :)
In that way, if you download a new Spark standalone version, you can set the Python version which you want to run PySpark to.
I was running it in IPython (as described in this link by Jacek Wasilewski ) and was getting this exception; Added PYSPARK_PYTHON
to the IPython kernel file and used jupyter notebook to run, and started working.
vi ~/.ipython/kernels/pyspark/kernel.json
{
"display_name": "pySpark (Spark 1.4.0)",
"language": "python",
"argv": [
"/usr/bin/python2",
"-m",
"IPython.kernel",
"--profile=pyspark",
"-f",
"{connection_file}"
],
"env": {
"SPARK_HOME": "/usr/local/spark-1.6.1-bin-hadoop2.6/",
"PYTHONPATH": "/usr/local/spark-1.6.1-bin-hadoop2.6/python/:/usr/local/spark-1
.6.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip",
"PYTHONSTARTUP": "/usr/local/spark-1.6.1-bin-hadoop2.6/python/pyspark/shell.py
",
"PYSPARK_SUBMIT_ARGS": "--master spark://127.0.0.1:7077 pyspark-shell",
"PYSPARK_DRIVER_PYTHON":"ipython2",
"PYSPARK_PYTHON": "python2"
}
Setting PYSPARK_PYTHON=python3
and PYSPARK_DRIVER_PYTHON=python3
both to python3 works for me. I did this using export in my .bashrc. In the end, these are the variables I create:
export SPARK_HOME="$HOME/Downloads/spark-1.4.0-bin-hadoop2.4"
export IPYTHON=1
export PYSPARK_PYTHON=/usr/bin/python3
export PYSPARK_DRIVER_PYTHON=ipython3
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
I also followed this tutorial to make it work from within Ipython3 notebook: http://ramhiser.com/2015/02/01/configuring-ipython-notebook-support-for-pyspark/
Please look at the below snippet:
#setting environment variable for pyspark in linux||ubuntu
#goto --- /usr/local/spark/conf
#create a new file named spark-env.sh copy all content of spark-env.sh.template to it
#then add below lines to it, with path to python
PYSPARK_PYTHON="/usr/bin/python3"
PYSPARK_DRIVER_PYTHON="/usr/bin/python3"
PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser"
#i was running python 3.6 ||run - 'which python' in terminal to find the path of python
If you are working on mac, use the following commands
export SPARK_HOME=`brew info apache-spark | grep /usr | tail -n 1 | cut -f 1 -d " "`/libexec
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
export HADOOP_HOME=`brew info hadoop | grep /usr | head -n 1 | cut -f 1 -d " "`/libexec
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native/:$LD_LIBRARY_PATH
export PYSPARK_PYTHON=python3
If you are using another OS, check the following link: https://github.com/GalvanizeDataScience/spark-install
Helped in my case:
import os
os.environ["SPARK_HOME"] = "/usr/local/Cellar/apache-spark/1.5.1/"
os.environ["PYSPARK_PYTHON"]="/usr/local/bin/python3"
I came across the same error message and I have tried three ways mentioned above. I listed the results as a complementary reference to others.
PYTHON_SPARK
and PYTHON_DRIVER_SPARK
value in spark-env.sh
does not work for me.os.environ["PYSPARK_PYTHON"]="/usr/bin/python3.5"
os.environ["PYSPARK_DRIVER_PYTHON"]="/usr/bin/python3.5"
does not work for me.~/.bashrc
works like a charm~You can specify the version of Python for the driver by setting the appropriate environment variables in the ./conf/spark-env.sh
file. If it doesn't already exist, you can use the spark-env.sh.template
file provided which also includes lots of other variables.
Here is a simple example of a spark-env.sh
file to set the relevant Python environment variables:
#!/usr/bin/env bash
# This file is sourced when running various Spark programs.
export PYSPARK_PYTHON=/usr/bin/python3
export PYSPARK_DRIVER_PYTHON=/usr/bin/ipython
In this case it sets the version of Python used by the workers/executors to Python3 and the driver version of Python to iPython for a nicer shell to work in.
If you don't already have a spark-env.sh
file, and don't need to set any other variables, this one should do what you want, assuming that paths to the relevant python binaries are correct (verify with which
). I had a similar problem and this fixed it.
I am using the following environment
? python --version; ipython --version; jupyter --version
Python 3.5.2+
5.3.0
5.0.0
and the following aliases work well for me
alias pyspark="PYSPARK_PYTHON=/usr/local/bin/python3 PYSPARK_DRIVER_PYTHON=ipython ~/spark-2.1.1-bin-hadoop2.7/bin/pyspark --packages graphframes:graphframes:0.5.0-spark2.1-s_2.11"
alias pysparknotebook="PYSPARK_PYTHON=/usr/bin/python3 PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS='notebook' ~/spark-2.1.1-bin-hadoop2.7/bin/pyspark --packages graphframes:graphframes:0.5.0-spark2.1-s_2.11"
In the notebook, I set up the environment as follows
from pyspark.context import SparkContext
sc = SparkContext.getOrCreate()
In my case (Ubuntu 18.04), I ran this code in terminal:
sudo vim ~/.bashrc
and then edited SPARK_HOME
as follows:
export SPARK_HOME=/home/muser/programs/anaconda2019/lib/python3.7/site-packages/pyspark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
By doing so, my SPARK_HOME
will refer to the pyspark
package I installed in the site-package
.
To learn how to use vim
, go to this link.
Ran into this today at work. An admin thought it prudent to hard code Python 2.7 as the PYSPARK_PYTHON
and PYSPARK_DRIVER_PYTHON
in $SPARK_HOME/conf/spark-env.sh
. Needless to say this broke all of our jobs that utilize any other python versions or environments (which is > 90% of our jobs). @PhillipStich points out correctly that you may not always have write permissions for this file, as is our case. While setting the configuration in the spark-submit
call is an option, another alternative (when running in yarn/cluster mode) is to set the SPARK_CONF_DIR
environment variable to point to another configuration script. There you could set your PYSPARK_PYTHON and any other options you may need. A template can be found in the spark-env.sh source code on github.
Source: Stackoverflow.com