How to set up Spark on Windows

Question

I am trying to setup Apache Spark on Windows   After searching a bit  I understand that the standalone mode is what I want  Which binaries do I download in order to run Apache spark in windows  I see distributions with hadoop and cdh at the spark download page   I don t have references in web to this  A step by step guide to this is highly appreciated

User · Accepted Answer

I found the easiest solution on Windows is to build from source.

You can pretty much follow this guide: http://spark.apache.org/docs/latest/building-spark.html

Download and install Maven, and set MAVEN_OPTS to the value specified in the guide.

But if you're just playing around with Spark, and don't actually need it to run on Windows for any other reason that your own machine is running Windows, I'd strongly suggest you install Spark on a linux virtual machine. The simplest way to get started probably is to download the ready-made images made by Cloudera or Hortonworks, and either use the bundled version of Spark, or install your own from source or the compiled binaries you can get from the spark website.

User · Answer

You can download spark from here   http   spark apache org downloads html  I recommend you this version  Hadoop 2  HDP2  CDH5   Since version 1 0 0 there are  cmd scripts to run spark in windows   Unpack it using 7zip or similar   To start you can execute  bin spark-shell cmd --master local 2   To configure your instance  you can follow this link  http   spark apache org docs latest

User · Answer

Trying to work with spark-2 x x  building Spark source code didn t work for me    So  although I m not going to use Hadoop  I downloaded the pre-built Spark with hadoop embeded   spark-2 0 0-bin-hadoop2 7 tar gz Point SPARK HOME on the extracted directory  then add to PATH    SPARK HOME  bin  Download the executable winutils from the Hortonworks repository  or from Amazon AWS platform winutils  Create a directory where you place the executable winutils exe  For example  C  SparkDev x64  Add the environment variable  HADOOP HOME  which points to this directory  then add  HADOOP HOME  bin to PATH  Using command line  create the directory   mkdir C  tmp hive  Using the executable that you downloaded  add full permissions to the file directory you created but using the unixian formalism    HADOOP HOME  bin winutils exe chmod 777  tmp hive  Type the following command line    SPARK HOME  bin spark-shell    Scala command line input should be shown automatically   Remark   You don t need to configure Scala separately  It s built-in too

User · Answer

Here are seven steps to install spark on windows 10 and run it from python   Step 1  download the spark 2 2 0 tar  tape Archive  gz file to any folder F from this link - https   spark apache org downloads html  Unzip it and copy the unzipped folder to the desired folder A  Rename the spark-2 2 0-bin-hadoop2 7 folder to spark   Let path to the spark folder be C  Users Desktop A spark  Step 2  download the hardoop 2 7 3 tar gz file to the same folder F from this link - https   www apache org dyn closer cgi hadoop common hadoop-2 7 3 hadoop-2 7 3 tar gz  Unzip it and copy the unzipped folder to the same folder A  Rename the folder name from Hadoop-2 7 3 tar to hadoop  Let path to the hadoop folder be C  Users Desktop A hadoop  Step 3  Create a new notepad text file  Save this empty notepad file as winutils exe  with Save as type  All files   Copy this O KB winutils exe file to your bin folder in spark - C  Users Desktop A spark bin  Step 4  Now  we have to add these folders to the System environment    4a  Create a system variable  not user variable as user variable will inherit all the properties of the system variable  Variable name  SPARK HOME Variable value  C  Users Desktop A spark  Find Path system variable and click edit  You will see multiple paths  Do not delete any of the paths  Add this variable value -  C  Users Desktop A spark bin  4b  Create a system variable   Variable name  HADOOP HOME Variable value  C  Users Desktop A hadoop  Find Path system variable and click edit  Add this variable value -  C  Users Desktop A hadoop bin  4c  Create a system variable  Variable name  JAVA HOME Search Java in windows  Right click and click open file location  You will have to again right click on any one of the java files and click on open file location  You will be using the path of this folder  OR you can search for C  Program Files Java  My Java version installed on the system is jre1 8 0 131  Variable value  C  Program Files Java jre1 8 0 131 bin  Find Path system variable and click edit  Add this variable value -  C  Program Files Java jre1 8 0 131 bin  Step 5  Open command prompt and go to your spark bin folder  type cd C  Users Desktop A spark bin   Type spark-shell   C  Users Desktop A spark bin gt spark-shell   It may take time and give some warnings  Finally  it will show  welcome to spark version 2 2 0  Step 6  Type exit   or restart the command prompt and go the spark bin folder again  Type pyspark   C  Users Desktop A spark bin gt pyspark   It will show some warnings and errors but ignore  It works   Step 7  Your download is complete  If you want to directly run spark from python shell then  go to Scripts in your python folder and type  pip install findspark   in command prompt   In python shell  import findspark findspark init     import the necessary modules  from pyspark import SparkContext from pyspark import SparkConf   If you would like to skip the steps for importing findspark and initializing it  then please follow the procedure given in  importing pyspark in python shell

User · Answer

The guide by Ani Menon  thx   almost worked for me on windows 10  i just had to get a newer winutils exe off that git  currently hadoop-2 8 1   https   github com steveloughran winutils

User · Answer

You can use following ways to setup Spark    Building from Source Using prebuilt release   Though there are various ways to build Spark from Source  First I tried building Spark source with SBT but that requires hadoop  To avoid those issues  I used pre-built release   Instead of Source I downloaded Prebuilt release for hadoop 2 x version and ran it  For this you need to install Scala as prerequisite    I have collated all steps here   How to run Apache Spark on Windows7 in standalone mode     Hope it ll help you

User · Answer

Steps to install Spark in local mode   Install Java 7 or later  To test java installation is complete  open command prompt type java and hit enter  If you receive a message  Java  is not recognized as an internal or external command   You need to configure your environment variables  JAVA HOME and PATH to point to the path of jdk   Download and install Scala  Set SCALA HOME in Control Panel System and Security System goto  quot Adv System settings quot  and add  SCALA HOME  bin in PATH variable in environment variables   Install Python 2 6 or later from Python Download link   Download SBT  Install it and set SBT HOME as an environment variable with value as  lt  lt SBT PATH gt  gt    Download winutils exe from HortonWorks repo or git repo  Since we don t have a local Hadoop installation on Windows we have to download winutils exe and place it in a bin directory under a created Hadoop home directory  Set HADOOP HOME    lt  lt Hadoop home directory gt  gt  in environment variable   We will be using a pre-built Spark package  so choose a Spark pre-built package for Hadoop Spark download  Download and extract it  Set SPARK HOME and add  SPARK HOME  bin in PATH variable in environment variables   Run command  spark-shell  Open http   localhost 4040  in a browser to see the SparkContext web UI

User · Answer

Cloudera and Hortonworks are the best tools to start up with the  HDFS in Microsoft Windows  You can also use VMWare or VBox to initiate Virtual Machine to establish build to your HDFS and Spark  Hive  HBase  Pig  Hadoop with Scala  R  Java  Python

User · Answer

Here s the fixes to get it to run in Windows without rebuilding everything - such as if you do not have a recent version of MS-VS   You will need a Win32 C   compiler  but you can install MS VS Community Edition free    I ve tried this with Spark 1 2 2 and mahout 0 10 2 as well as with the latest versions in November 2015   There are a number of problems including the fact that the Scala code tries to run a bash script  mahout bin mahout  which does not work of course  the sbin scripts have not been ported to windows  and the winutils are missing if hadoop is not installed    1  Install scala  then unzip spark hadoop mahout into the root of C  under their respective product names     2  Rename  mahout bin mahout to mahout sh was  we will not need it     3  Compile the following Win32 C   program and copy the executable to a file named C  mahout bin mahout  that s right - no  exe suffix  like a Linux executable    include  stdafx h   define BUFSIZE 4096  define VARNAME TEXT  MAHOUT CP   int  tmain int argc   TCHAR  argv          DWORD dwLength      LPTSTR pszBuffer      pszBuffer    LPTSTR malloc BUFSIZE sizeof TCHAR        dwLength   GetEnvironmentVariable VARNAME  pszBuffer  BUFSIZE       if  dwLength  gt  0     tprintf TEXT   s n    pszBuffer   return 0        return 1       4  Create the script  mahout bin mahout bat and paste in the content below  although the exact names of the jars in the  CP class paths will depend on the versions of spark and mahout  Update any paths per your installation  Use 8 3 path names without spaces in them  Note that you cannot use wildcards asterisks in the classpaths here   set SCALA HOME C  Progra 2 scala set SPARK HOME C  spark set HADOOP HOME C  hadoop set MAHOUT HOME C  mahout set SPARK SCALA VERSION 2 10 set MASTER local 2  set MAHOUT LOCAL true set path  SCALA HOME  bin  SPARK HOME  bin  PATH  cd  D  SPARK HOME  set SPARK CP  SPARK HOME  conf   SPARK HOME  lib xxx jar    other jars    set MAHOUT CP  MAHOUT HOME  lib xxx jar    other jars     MAHOUT HOME  xxx jar    other jars     SPARK CP   MAHOUT HOME  lib spark xxx jar  MAHOUT HOME  lib hadoop xxx jar  MAHOUT HOME  src conf  JAVA HOME  lib tools jar start  master0    JAVA HOME  bin java  -cp   SPARK CP   -Xms1g -Xmx1g org apache spark deploy master Master --ip localhost --port 7077 --webui-port 8082  gt  gt out-master0 log 2 gt  gt out-master0 err start  worker1    JAVA HOME  bin java  -cp   SPARK CP   -Xms1g -Xmx1g org apache spark deploy worker Worker spark   localhost 7077 --webui-port 8083  gt  gt out-worker1 log 2 gt  gt out-worker1 err    you may add more workers here    cd  D  MAHOUT HOME    JAVA HOME  bin java  -Xmx4g -classpath   MAHOUT CP    org apache mahout sparkbindings shell Main    The name of the variable MAHOUT CP should not be changed  as it is referenced in the C   code   Of course you can comment-out the code that launches the Spark master and worker because Mahout will run Spark as-needed  I just put it in the batch job to show you how to launch it if you wanted to use Spark without Mahout    5  The following tutorial is a good place to begin   https   mahout apache org users sparkbindings play-with-shell html   You can bring up the Mahout Spark instance at    C  Program Files  x86  Google Chrome Application chrome  --disable-web-security http   localhost 4040

User · Answer

Here is a simple minimum script to run from any python console  It assumes that you have extracted the Spark libraries that you have downloaded into C  Apache spark-1 6 1   This works in Windows without building anything and solves problems where Spark would complain about recursive pickling   import sys import os spark home    C  Apache spark-1 6 1   sys path insert 0  os path join spark home   python    sys path insert 0  os path join spark home   python lib pyspark zip     sys path insert 0  os path join spark home   python lib py4j-0 9-src zip        Start a spark context  sc   pyspark SparkContext       lines   sc textFile os path join spark home   README md   pythonLines   lines filter lambda line   Python  in line  pythonLines first

[windows] How to set up Spark on Windows?

Examples related to windows

Examples related to apache-spark