What is Apache Spark & How to Install Spark?

Blog: GestiSoft

To analyze the large data sets, most of the industries are using Hadoop. We’ve already discussed about Haddop in detail in our previous posts. Through this post we are going to look into the various insights details of Apache Spark such as the definition of Apache spark, how it came into existence, features and finally how to install spark successfully. Spark was introduced by Apache Software Foundation for speeding up the Hadoop computational computing software process.

Definition of Apache Spark

There are many definitions of spark revolving around the internet. You might have heard “Apache Spark is an open source big data processing framework built around speed, ease of use, and sophisticated analytics” or “Spark is as a VERY fast in-memory, data-processing framework – like lightning fast. 100x faster than Hadoop fast.” Well, all these terms signifies one common thing i.e Spark has reduced the time between queries and waiting time to run the prgram which enables it to run much faster than the Hadoop. Many people think that Spark is not a extension or modified version of Hadoop which is not true at all. Spark has its own cluster management computation. Yes, Spark uses Hadoop in two possible ways i.e Storage and Processing, as already mentioned that spark has its own cluster management so Spark uses Hadoop for Storage purposes only.

This might have clear some air about What is Spark Apache? Let’s study how it came into existence and whether is it the replacement of Hadoop MapReduce or not?

Evolution of Apache Spark

Spark was developed in 2009 in UC Berkeley’s AMPLab by Matei Zaharia. It is one of the Hadoop’s sub project which was open sourced in 2010 under a BSD license. Now the question is why Apache Spark came into Evolution? Let’s Understand this.

Hadoop has been processing large data sets for around of 10 years now and is considered as the best big data processing technology. When we talk about the workflow of data processing there two phase i.e. one Map phase and one Reduce phase and you’ll need to convert any use case into Map Reduce pattern to leverage this solution. Map Reduce has proven to be the best solution for one pass computations but not that much responsive or effective for use case that needs muti pass computations.

In this approach, before the start of new step, the data of previous step has to be stored in the distributed file system which leads other problems such as the slow processing due to replication and disk storage. Also to execute something complex or complicated, a series of MapReduce jobs has to be string together. As these jobs are executing in sequence so the next job will not start until the completion of previous job.

This is the reason Spark came into existence. Spark uses DAG i.e. Directed Acyclic Graph pattern which allows multi pass computations. Also, different jobs can be done with the same data as this approach supports in-memory data sharing across DAGs. Spark should be taken as an alternative not an replacement of Hadoop MapReduce. The purpose behind this approach is to provide a solution to few problems that has been occurring with the MapReduce approach.

Features of Apache Spark

Fast: With capabilities like inbuild memory and near real time processing, spark is much faster than Hadoop MapReduce, it is about 100 times faster in memory and 10 times faster when running on disk.
Mutiple Programming Languages: Spark is written in Scala and runs on Java Virtual Machine environment but supports multiple programming languages such as Scala, Java, Python, Clojure, R
In-Build Memory: Spark supports in-memory data sharing across DAGs which allows the execution of different jobs with the same data.
DAG Pattern: Spark uses DAG i.e. Directed Acyclic Graph pattern which allows multi pass computations.
Enhanced Analytics: Spark also supports SQL queries, Machine learning (ML), Streaming data, and Graph algorithms other than Map Reduce.

How to Install Spark?

Now the question how to install Apache Spark. We have broke down the entire procedure into four simple steps. Follow the installation steps explained below explicitly.

Step 1: Install Java

Firstly you need to install java successfully and then run the following commands.

sudo apt-add-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java7-installer

To check whether the Java is successfully installed or not, you need the run the command.

$scala -version

“”If the Java is installed then you will get to see the following:

java version “1.7.0_71″

Java(TM) SE Runtime Environment (build 1.7.0_71-b13)

Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)””

Step 2: Install Scala

Step two is to download the latest version of scala by visiting the www.scala-lang.org and then run the following commands.

sudo mkdir /usr/local/src/scala
sudo tar -xvf scala-2.11.7.tgz -C /usr/local/src/scala/
nano .bashrc
export SCALA_HOME=/usr/local/src/scala/scala-2.11.7
export PATH=$SCALA_HOME/bin:$PATH
. .bashrc

To check whether the Java is successfully installed or not, you need the run the command.

$scala -version

If the Scala is successfully installed then you will get to see.

Step 3: Install Git

Next step is to install Git. Command is given below.

sudo apt-get install git

Step 4: Build Spark

The final step is to download the latest version of spark by visiting the www.spark.apache.org

tar -xvf spark-1.4.1.tgz
sbt/sbt assembly

For detailed knowledge you can check out the video.

Conclusion:

The purpose of sharing this post is to make you understand the basics of Apache Spark. Through this post, we’ve study how Apache Spark came into existence, what is the definition of Spark and how it is better than the MapReduce approach. Also we looked at how to install spark successfully. Hopefully, this post severs its purpose by providing you what you were looking for. Your feedback is always welcome. If you have any question or suggestion to make then what you are waiting for? Write us down through the comment box.

The post What is Apache Spark & How to Install Spark? appeared first on Big Data Science Training.