Install Spark(PySpark) to run in Jupyter Notebook on Windows

sunny savita Oct 13 2020 · 4 min read
Share this

Apache Spark

Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for more types of computations, which includes interactive queries and stream processing. The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application.

Spark is designed to cover a wide range of workloads such as batch applications, iterative algorithms, interactive queries and streaming. Apart from supporting all these workload in a respective system, it reduces the management burden of maintaining separate tools.

Features of Apache Spark

Apache Spark has following features.

  • Speed − Spark helps to run an application in Hadoop cluster, up to 100 times faster in memory, and 10 times faster when running on disk. This is possible by reducing number of read/write operations to disk. It stores the intermediate processing data in memory.
  • Supports multiple languages − Spark provides built-in APIs in Java, Scala, or Python. Therefore, you can write applications in different languages. Spark comes up with 80 high-level operators for interactive querying.
  • Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also supports SQL queries, Streaming data, Machine learning (ML), and Graph algorithms.
  • Components of Spark

    The following illustration depicts the different components of Spark.

    Apache Spark Core

    Spark Core is the underlying general execution engine for spark platform that all other functionality is built upon. It provides In-Memory computing and referencing datasets in external storage systems.

    Spark SQL

    Spark SQL is a component on top of Spark Core that introduces a new data abstraction called SchemaRDD, which provides support for structured and semi-structured data.

    Spark Streaming

    Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming analytics. It ingests data in mini-batches and performs RDD (Resilient Distributed Datasets) transformations on those mini-batches of data.

    MLlib (Machine Learning Library)

    MLlib is a distributed machine learning framework above Spark because of the distributed memory-based Spark architecture. It is, according to benchmarks, done by the MLlib developers against the Alternating Least Squares (ALS) implementations. Spark MLlib is nine times as fast as the Hadoop disk-based version of Apache Mahout (before Mahout gained a Spark interface).

    GraphX

    GraphX is a distributed graph-processing framework on top of Spark. It provides an API for expressing graph computation that can model the user-defined graphs by using Pregel abstraction API. It also provides an optimized runtime for this abstraction.

    PySpark – Overview

    Apache Spark is written in Scala programming language. To support Python with Spark, Apache Spark Community released a tool, PySpark. Using PySpark, you can work with RDDs in Python programming language also. It is because of a library called Py4j that they are able to achieve this.

    PySpark offers PySpark Shell which links the Python API to the spark core and initializes the Spark context. Majority of data scientists and analytics experts today use Python because of its rich library set. Integrating Python with Spark is a boon to them.

    PySpark Installation and setup

    1. Install Java 

    Before you can start with spark and hadoop, you need to make sure you have  installed java (vesion should be at least java8 or above java8).Go to Java’s official download website, accept Oracle license and download Java JDK 8, suitable to your system.

    This will take you to Java downloads. Scroll down until you see the section below and click on the Download button.

    This will take you to the download page. Scroll down to the section shown below and accept the License Agreement and select the download option for your operating system.

    After downloading the java install the java in your system, and JAVA by default will be installed in:

    C:\Program Files\Java\jdk1.8.0_201

    Add the following environment variable:

    JAVA_HOME = C:\Program Files\Java\jdk1.8.0_201

    Add to PATH variable the following directory:

    C:\Program Files\Java\jdk1.8.0_201\bin

    For setting the path into system variable see the images below.

    Click on the Advanced tab and then click on Environmental Variables. The following will should show up.

    A New User Variable window will pop up. Now create the JAVA_HOME variable, see the image below.

    After adding the java_home inside the system variable add  C:\Program Files\Java\jdk1.8.0_201\bin to path variable.

    2. Download and Install Spark

    Go to Spark home page, and download the .tgz file from  3.0.1(02 sep 2020)  version which is a latest version of spark.After that choose a package which has been shown in the image itself.

    Extract the file to your chosen directory (7z can open tgz). In my case, it was C:\spark. There is another compressed directory in the tar, extract it (into here) as well.

    Setup the environment variables

    SPARK_HOME = C:\spark\spark-2.3.2-bin-hadoop2.7
    HADOOP_HOME = C:\spark\spark-2.3.2-bin-hadoop2.7

    add SPARK_HOME(HADOOP_HOME also) variable.
    Add the following path to PATH environment variable:

    C:\spark\spark-2.3.2-bin-hadoop2.7\bin

    3. Spark: Some more stuff (winutils)

  • Download winutils.exe from here: https://github.com/steveloughran/winutils
  • Choose the same version as the package type you choose for the Spark .tgz file you chose in section 2 “Spark: Download and Install” (in my case: hadoop-2.7.1)
  • You need to navigate inside the hadoop-X.X.X folder, and inside the bin folder you will find winutils.exe
  • If you chose the same version as me (hadoop-2.7.1) here is the direct link: https://github.com/steveloughran/winutils/blob/master/hadoop-2.7.1/bin/winutils.exe
  • 2. Move the winutils.exe file to the inside bin folder.

  • In my case: D:\Spark\spark-2.3.0-bin-hadoop2.7\bin
  • 4. Install Anaconda framework

     We need to install Anaconda framework in our system.without anaconda we can also run the spark in our system but in our case we will use anaconda framework.

    After successful installation of anaconda, check in command prompt through “conda” command. Then execute the following command to create a virtual environment through the anaconda prompt  .After successful environmental setup, it is important to activate environment.

    conda create --name environment_name(whatever name you want to give) python = 3.6.9(python version should be greater than 3.5)

    conda activate environment_name           

    Then after activate the environment run pip install jupyter notebook 

    5. Check PySpark installation

    In your anaconda prompt, type pyspark, to enter pyspark shell. To be prepared, best to check it in the python environment from which you run jupyter notebook. You supposed to see the following:

    pyspark shell on anaconda prompt

    5. PySpark with Jupyter notebook

    Install findspark, to access spark instance from jupyter notebook. Check current installation in Anaconda cloud.

    conda install -c conda-forge findspark
    or

    pip insatll findspark

    Open your python jupyter notebook, and write inside:

    import findspark
    findspark.init()

    findspark.find()
    import pyspark
    findspark.find()

    Troubleshooting

  • Anaconda pyspark. Anaconda has its own pyspark package. In my case, the apache pyspark and the anaconda, did not coexists well, so I had to uninstall anaconda pyspark.
  • Code will not work if you have more than one spark, or spark-shell instance open.
  • Print environment variables inside jupyter notebook.
  • import os
    print(os.environ['SPARK_HOME'])
    print(os.environ['JAVA_HOME'])
    print(os.environ['PATH'])

  • If you need more explanation on how to manage system variables, command prompt, etc. It's all here: basic-window-tools-for-installations
  • set command in cmd, print out all environment variables and their values, so check that your changes took place.
  • As always, re-open cmd, and even reboot, can solve problems.
  • Comments
    Read next