Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for more types of computations, which includes interactive queries and stream processing. The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application.
Spark is designed to cover a wide range of workloads such as batch applications, iterative algorithms, interactive queries and streaming. Apart from supporting all these workload in a respective system, it reduces the management burden of maintaining separate tools.
Features of Apache Spark
Apache Spark has following features.
Components of Spark
The following illustration depicts the different components of Spark.
Apache Spark Core
Spark Core is the underlying general execution engine for spark platform that all other functionality is built upon. It provides In-Memory computing and referencing datasets in external storage systems.
Spark SQL is a component on top of Spark Core that introduces a new data abstraction called SchemaRDD, which provides support for structured and semi-structured data.
Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming analytics. It ingests data in mini-batches and performs RDD (Resilient Distributed Datasets) transformations on those mini-batches of data.
MLlib (Machine Learning Library)
MLlib is a distributed machine learning framework above Spark because of the distributed memory-based Spark architecture. It is, according to benchmarks, done by the MLlib developers against the Alternating Least Squares (ALS) implementations. Spark MLlib is nine times as fast as the Hadoop disk-based version of Apache Mahout (before Mahout gained a Spark interface).
GraphX is a distributed graph-processing framework on top of Spark. It provides an API for expressing graph computation that can model the user-defined graphs by using Pregel abstraction API. It also provides an optimized runtime for this abstraction.
PySpark – Overview
Apache Spark is written in Scala programming language. To support Python with Spark, Apache Spark Community released a tool, PySpark. Using PySpark, you can work with RDDs in Python programming language also. It is because of a library called Py4j that they are able to achieve this.
PySpark offers PySpark Shell which links the Python API to the spark core and initializes the Spark context. Majority of data scientists and analytics experts today use Python because of its rich library set. Integrating Python with Spark is a boon to them.
PySpark Installation and setup
1. Install Java
Before you can start with spark and hadoop, you need to make sure you have installed java (vesion should be at least java8 or above java8).Go to Java’s official download website, accept Oracle license and download Java JDK 8, suitable to your system.
This will take you to Java downloads. Scroll down until you see the section below and click on the Download button.
This will take you to the download page. Scroll down to the section shown below and accept the License Agreement and select the download option for your operating system.
After downloading the java install the java in your system, and JAVA by default will be installed in:
Add the following environment variable:
JAVA_HOME = C:\Program Files\Java\jdk1.8.0_201
Add to PATH variable the following directory:
For setting the path into system variable see the images below.
Click on the Advanced tab and then click on Environmental Variables. The following will should show up.
A New User Variable window will pop up. Now create the JAVA_HOME variable, see the image below.
After adding the java_home inside the system variable add C:\Program Files\Java\jdk1.8.0_201\bin to path variable.
2. Download and Install Spark
Go to Spark home page, and download the .tgz file from 3.0.1(02 sep 2020) version which is a latest version of spark.After that choose a package which has been shown in the image itself.
Extract the file to your chosen directory (7z can open tgz). In my case, it was C:\spark. There is another compressed directory in the tar, extract it (into here) as well.
Setup the environment variables
SPARK_HOME = C:\spark\spark-2.3.2-bin-hadoop2.7
HADOOP_HOME = C:\spark\spark-2.3.2-bin-hadoop2.7
Add the following path to PATH environment variable:
3. Spark: Some more stuff (winutils)
2. Move the winutils.exe file to the inside bin folder.
4. Install Anaconda framework
We need to install Anaconda framework in our system.without anaconda we can also run the spark in our system but in our case we will use anaconda framework.
After successful installation of anaconda, check in command prompt through “conda” command. Then execute the following command to create a virtual environment through the anaconda prompt .After successful environmental setup, it is important to activate environment.
conda create --name environment_name(whatever name you want to give) python = 3.6.9(python version should be greater than 3.5)
conda activate environment_name
Then after activate the environment run pip install jupyter notebook
5. Check PySpark installation
In your anaconda prompt, type pyspark, to enter pyspark shell. To be prepared, best to check it in the python environment from which you run jupyter notebook. You supposed to see the following:
5. PySpark with Jupyter notebook
Install findspark, to access spark instance from jupyter notebook. Check current installation in Anaconda cloud.
conda install -c conda-forge findspark
pip insatll findspark
Open your python jupyter notebook, and write inside: