Spark : The General Unified Engine

#bigdata #spark #map-reduce

shriyansh jain Jun 30 2021 · 2 min read
Share this

History of spark

In 2009 when AMPLab Berkeley start working on building the resource manager called MESOS and to test weather it is working good or not they create a framework similar to map reduce called Spark, the only difference is that spark is completely IN Memory execution. And year later people started to exploring spark as if we write program in spark it is running faster compared with map reduce. In 2012 it became the Apache project.

Map Reduce

Basic idea of Map Reduce is divide and conquer it is a framework not a language the main concept involves that we bring the program to data and the collect and aggregate the output in map phase.Map reduce is present since 2004 but what it does is batch processing and it is very slow.but back then it is the only way to deal with with the huge data and still it is very useful and in backed still we use map reduce only in many software like MongoDB, Splunk uses Map Reduce Framework only.

Need more the Map Reduce

As per the use case we need different different tools to manage the workload,in real world we need more as we need SQL , Machine learning, Real data Analysis on Hadoop  so map reduce cannot perform them alone so in mean time various technology has come up like for SQL on Hadoop we have Impala, Drill, hive, pig. and for Machine Learning we have something called Mahout. for Real data analysis or real time streaming we have Apache storm, so we have various libraries as per need  and the functionalities required but problem here is that to start working in Hadoop you have to learn all these tool for example if we want to do real time analysis we have to learn storm we have to learn how to integrate storm with Hadoop then we have to pick one SQL tool to look into data.

Spark

In 2014 when spark came the biggest benefit of the spark is that what these 50+ tools are doing spark alone can do this that is the real advantage of the spark the speed is the byproduct of spark not the real reason why we are using it.

Primary reason why people are using spark is because that it is General Unified Engine and second reason is that it is faster and third reason is ease of programming.

Spark Ecosystem

Spark Ecosystem

We can see in this diagram that first layer shows that spark supports 4 languages Python, Scala, Java and R. In which Scala was the fastest in spark version 1.6 later from 2.0 python and Scala take same time to execute due to optimization.

In second layer we can see that these are the libraries available in spark. For executing SQL we have HIVE by default they have integrated Spark SQL with hive. For Machine Learning we can use ML Lib. GraphX is a framework where you can represent your data using graph which is useful in various social media platform. And for Real time Data Analysis spark have something called spark Streaming.

In third Layer we have modes in which spark can run. first is local mode or local system used for developing and testing purpose. then we can run it on standalone mode where we don't have any other cluster manager then spark provides its own Cluster manager ,we can run it on MESOS originally it was created on MESOS but it is rarely used. and the most commonly it is used on YARN  as most of the organization will have Hadoop and it make sense to use spark on YARN.

On top of every mode we can see logo of man standing with stick this is the logo of zookeeper,so zookeeper can made YARN,MESOS, and Standalone modes highly available or we can say zookeeper ensures high availability.

References:

https://spark.apache.org/

https://www.youtube.com/watch?v=zC9cnh8rJd0&t=2806s

Comments
Read next