[windows] How to set up Spark on Windows?

Here are seven steps to install spark on windows 10 and run it from python:

Step 1: download the spark 2.2.0 tar (tape Archive) gz file to any folder F from this link - https://spark.apache.org/downloads.html. Unzip it and copy the unzipped folder to the desired folder A. Rename the spark-2.2.0-bin-hadoop2.7 folder to spark.

Let path to the spark folder be C:\Users\Desktop\A\spark

Step 2: download the hardoop 2.7.3 tar gz file to the same folder F from this link - https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz. Unzip it and copy the unzipped folder to the same folder A. Rename the folder name from Hadoop-2.7.3.tar to hadoop. Let path to the hadoop folder be C:\Users\Desktop\A\hadoop

Step 3: Create a new notepad text file. Save this empty notepad file as winutils.exe (with Save as type: All files). Copy this O KB winutils.exe file to your bin folder in spark - C:\Users\Desktop\A\spark\bin

Step 4: Now, we have to add these folders to the System environment.

4a: Create a system variable (not user variable as user variable will inherit all the properties of the system variable) Variable name: SPARK_HOME Variable value: C:\Users\Desktop\A\spark

Find Path system variable and click edit. You will see multiple paths. Do not delete any of the paths. Add this variable value - ;C:\Users\Desktop\A\spark\bin

4b: Create a system variable

Variable name: HADOOP_HOME Variable value: C:\Users\Desktop\A\hadoop

Find Path system variable and click edit. Add this variable value - ;C:\Users\Desktop\A\hadoop\bin

4c: Create a system variable Variable name: JAVA_HOME Search Java in windows. Right click and click open file location. You will have to again right click on any one of the java files and click on open file location. You will be using the path of this folder. OR you can search for C:\Program Files\Java. My Java version installed on the system is jre1.8.0_131. Variable value: C:\Program Files\Java\jre1.8.0_131\bin

Find Path system variable and click edit. Add this variable value - ;C:\Program Files\Java\jre1.8.0_131\bin

Step 5: Open command prompt and go to your spark bin folder (type cd C:\Users\Desktop\A\spark\bin). Type spark-shell.

C:\Users\Desktop\A\spark\bin>spark-shell

It may take time and give some warnings. Finally, it will show welcome to spark version 2.2.0

Step 6: Type exit() or restart the command prompt and go the spark bin folder again. Type pyspark:

C:\Users\Desktop\A\spark\bin>pyspark

It will show some warnings and errors but ignore. It works.

Step 7: Your download is complete. If you want to directly run spark from python shell then: go to Scripts in your python folder and type

pip install findspark

in command prompt.

In python shell

import findspark
findspark.init()

import the necessary modules

from pyspark import SparkContext
from pyspark import SparkConf

If you would like to skip the steps for importing findspark and initializing it, then please follow the procedure given in importing pyspark in python shell