Blog Posts Business Management

What is Apache Spark & How to Install Spark?

Blog: GestiSoft

To analyze the large data sets, most of the industries are using Hadoop. We’ve already discussed about Haddop in detail in our previous posts. Through this post we are going to look into the various insights details of Apache Spark such as the definition of Apache spark, how it came into existence, features and finally how to install spark successfully.  Spark was introduced by Apache Software Foundation for speeding up the Hadoop computational computing software process.

Definition of Apache Spark

There are many definitions of spark revolving around the internet. You might have heard “Apache Spark is an open source big data processing framework built around speed, ease of use, and sophisticated analytics”  or “Spark is as a VERY fast in-memory, data-processing framework – like lightning fast. 100x faster than Hadoop fast.” Well, all these terms signifies one common thing i.e Spark has reduced the time between queries and waiting time to run the prgram which enables it to run much faster than the Hadoop. Many people think that Spark is not a extension or modified version of Hadoop which is not true at all. Spark has its own cluster management computation. Yes, Spark uses Hadoop in two possible ways i.e Storage and Processing, as already mentioned that spark has its own cluster management so Spark uses Hadoop for Storage purposes only.

This might have clear some air about What is Spark Apache? Let’s study how it came into existence and whether is it the replacement of Hadoop MapReduce or not?

Evolution of Apache Spark

Spark was developed in 2009 in UC Berkeley’s AMPLab by Matei Zaharia. It is one of the Hadoop’s sub project which was open sourced in 2010 under a BSD license. Now the question is why Apache Spark came into Evolution? Let’s Understand this.

Hadoop has been processing large data sets for around of 10 years now and is considered as the best big data processing technology. When we talk about the workflow of data processing there two phase i.e. one Map phase and one Reduce phase and you’ll need to convert any use case into Map Reduce pattern to leverage this solution. Map Reduce has proven to be the best solution for one pass computations but not that much responsive or effective for use case that needs muti pass computations.

In this approach, before the start of new step, the data of previous step has to be stored in the distributed file system which leads other problems such as the slow processing due to replication and disk storage. Also to execute something complex or complicated, a series of MapReduce jobs has to be string together. As these jobs are executing in sequence so the next job will not start until the completion of previous job.

This is the reason Spark came into existence. Spark uses DAG i.e. Directed Acyclic Graph pattern which allows multi pass computations. Also, different jobs can be done with the same data as this approach supports in-memory data sharing across DAGs. Spark should be taken as an alternative not an replacement of Hadoop MapReduce. The purpose behind this approach is to provide a solution to few problems that has been occurring with the MapReduce approach.

Features of Apache Spark

 

How to Install Spark?

Now the question how to install Apache Spark. We have broke down the entire procedure into four simple steps. Follow the installation steps explained below explicitly.

Step 1: Install Java

Firstly you need to install java successfully and then run the following commands.

To check whether the Java is successfully installed or not, you need the run the command.

“”If the Java is installed then you will get to see the following:

java version “1.7.0_71″

Java(TM) SE Runtime Environment (build 1.7.0_71-b13)

Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)””

Step 2: Install Scala

Step two is to download the latest version of scala by visiting the www.scala-lang.org and then run the following commands.

To check whether the Java is successfully installed or not, you need the run the command.

If the Scala is successfully installed then you will get to see.

“”Scala code runner version 2.11.6 — Copyright 2002-2013, LAMP/EPFL””

 

Step 3: Install Git

Next step is to install Git. Command is given below.

 

Step 4: Build Spark

The final step is to download the latest version of spark by visiting the www.spark.apache.org

For detailed knowledge you can check out the video.

 

Conclusion:

The purpose of sharing this post is to make you understand the basics of Apache Spark. Through this post, we’ve study how Apache Spark came into existence, what is the definition of Spark and how it is better than the MapReduce approach. Also we looked at how to install spark successfully. Hopefully, this post severs its purpose by providing you what you were looking for. Your feedback is always welcome. If you have any question or suggestion to make then what you are waiting for? Write us down through the comment box.

The post What is Apache Spark & How to Install Spark? appeared first on Big Data Science Training.

Leave a Comment

Get the BPI Web Feed

Using the HTML code below, you can display this Business Process Incubator page content with the current filter and sorting inside your web site for FREE.

Copy/Paste this code in your website html code:

<iframe src="https://www.businessprocessincubator.com/content/what-is-apache-spark-how-to-install-spark/?feed=html" frameborder="0" scrolling="auto" width="100%" height="700">

Customizing your BPI Web Feed

You can click on the Get the BPI Web Feed link on any of our page to create the best possible feed for your site. Here are a few tips to customize your BPI Web Feed.

Customizing the Content Filter
On any page, you can add filter criteria using the MORE FILTERS interface:

Customizing the Content Filter

Customizing the Content Sorting
Clicking on the sorting options will also change the way your BPI Web Feed will be ordered on your site:

Get the BPI Web Feed

Some integration examples

BPMN.org

XPDL.org

×