Blog Posts

End-To-End ML Pipeline using Pyspark and Databricks (Part I)

Blog: Indium Software - Big Data

Primary questions to answer in this article are:

1. What is Databricks?

2. Why Databricks?

3. Initial Setup of Databricks?

What is Databricks?

DataBricks is an organization and big data processing platform founded by the creators of Apache Spark.

DataBricks was founded to provide an alternative to the MapReduce system and provides a just-in-time cloud-based platform for big data processing clients.

DataBricks was created for data scientists, engineers and analysts to help users integrate the fields of data science, engineering and the business behind them across the machine learning lifecycle. This integration helps to ease the processes from data preparation to experimentation and machine learning application deployment.

According to the company, the DataBricks platform is a hundred times faster than the open source Apache Spark. By unifying the pipeline involved with developing machine learning tools, DataBricks is said to accelerate development and innovation and increase security. Data processing clusters can be configured and deployed with just a few clicks. The platform includes varied built-in data visualization features to graph data.

Databricks is integrated with Microsoft Azure, Amazon Web Services, and Google Cloud Platform, making it easy for businesses to manage a colossal amount of data and carry out Machine Learning tasks.

DataBricks is headquartered in San Francisco, California and was founded by Ali Ghodsi, Andy Konwinshi, Scott Shenker, Ion Stoica, Patrick Wendell, Reynold Xin and Matei Zaharia.

Why Databricks?

After getting to know What is Databricks, let us also get started with some of its key features. Below are a few benefits of Databricks:

Initial Setup of Databricks

Step 1. Search for Databricks in Google Market place and subscribe for 14 day free trial.

Step 2. After starting the trial subscription, you will receive a link from the Databricks menu item in Google Cloud Platform. This is to manage setup on the Databricks hosted account management page.

Step 3. After this step, you must create a Workspace which is the environment in Databricks to access your assets. For this, you need an external Databricks Web Application

Step 4: To create a workspace, you need three nodes Kubernetes clusters in your Google Cloud Platform project using GKE to host the Databricks Runtime, which is your Data plane.

Step 5: We need to create a cluster to start working with Spark/Scala, a cluster is a combination of computation resources and configurations on which you can run jobs and notebooks. Some of the workloads that you can run on a Databricks Cluster include Streaming Analytics, ETL Pipelines, Machine Learning, and Ad-hoc analytics.

Step 6: We need to analyze the data on which we are going to develop a machine learning model. Upload the data, this creates a new table in the workspace.

The table can be created in two ways one through code and two through the UI, in this case we are using UI where we get options to change datatypes, Inferschema, first row as header, delmiter, etc.

File Location is under DBFS filepath : /FileStore/tables/<file_name>. This filepath will be used to read the data.


This blog gives a brief introduction about databricks and features but there are more features which are not covered in this, which would be covered in Part II of this blog.

After understanding completely What is Databricks, what are you waiting for! Get started! Companies need to understand what the insights of the data are, and how the data is stored and managed in Databricks.

In upcoming articles we would be covering reading data using pySpark, analyzing the data, EDA and model building using pyspark.

Please see the part 2 : The End-To-End ML Pipeline using Pyspark and Databricks (Part II)

The post End-To-End ML Pipeline using Pyspark and Databricks (Part I) appeared first on Indium Software.

Leave a Comment

Get the BPI Web Feed

Using the HTML code below, you can display this Business Process Incubator page content with the current filter and sorting inside your web site for FREE.

Copy/Paste this code in your website html code:

<iframe src="" frameborder="0" scrolling="auto" width="100%" height="700">

Customizing your BPI Web Feed

You can click on the Get the BPI Web Feed link on any of our page to create the best possible feed for your site. Here are a few tips to customize your BPI Web Feed.

Customizing the Content Filter
On any page, you can add filter criteria using the MORE FILTERS interface:

Customizing the Content Filter

Customizing the Content Sorting
Clicking on the sorting options will also change the way your BPI Web Feed will be ordered on your site:

Get the BPI Web Feed

Some integration examples