Blog Posts

Bootstrap a Modern Data Stack in 5 minutes with Terraform

Blog: Think Data Analytics Blog

What is a Modern Data Stack and how do you deploy one? This guide will motivate you to start on this journey with setup instructions for Airbyte, BigQuery, dbt, Metabase, and everything else you need using Terraform.

What is a Modern Data Stack

Modern Data Stack (MDS) is a stack of technologies that makes a modern data warehouse perform 10–10,000x better than a legacy data warehouse. Ultimately, an MDS saves time, money, and effort. The four pillars of an MDS are a data connector, a cloud data warehouse, a data transformer, and a BI & data exploration tool.

Easy integration is made possible with managed and open-source tools that pre-build hundreds of ready-to-use connectors. What used to take a team of data engineers to build and maintain regularly can now be replaced with a tool for simple use cases. Managed solutions like Stitch and Fivetran, together with open-source solutions like Airbyte and Meltano, are making this happen.

Using a cloud-based columnar data warehouse has been the trend recently due to its high performance and cost-effectiveness. Instead of paying $100K per year for an on-premise MPP (massively parallel processing) database, you can start paying from $100 (or less) per month. Cloud-native data warehouses are said to be 10–10,000x times faster than a traditional OLTP. Popular options in this category are BigQuerySnowflake, and Redshift.

In the old days, processing data inside the data warehouse was the bottleneck due to the technology’s limitations. As a result, companies had to favor ETL instead of ELT to reduce the workload of the data warehouse. With the advancement of cloud-native data warehouses, however, many in-data-warehouse transformation tools are becoming popular. Most notable in this category are dbt (data build tool) and Dataform.

BI tools used to take care of some transformations to reduce the workload on legacy data warehouses, too. However, with the modern data stack, BI tools’ focus has been shifted (in my opinion) to democratize data access, self-service, and data discovery. Some tools that I think are heading in the right direction are LookerMetabase, and Superset.

Our architecture

Getting started with the Modern Data Stack can be daunting as many different tools and processes are involved. This article aims to help you get started on this journey as seamlessly as possible. There are many preparation steps, but it only takes five minutes to spin up all resources once you are done.

We will be using Terraform, an infrastructure-as-code open-source tool to provision everything in Google Cloud. If you follow the instructions below, here are the resources that will be created.

Get Started

Create a Google Cloud account and enable billing

The Terraform code in this project will interact with the Google Cloud Platform. Therefore, our first step is to creating a Google account and enable billing. Note the billing ID with the following format in the Billing Page: ######-######-######. You will need this value in the next step.

Install Google Cloud CLI

Install Google Cloud SDK following the instructions here for your respective OS. After you have the gcloud CLI installed, run the following command in a terminal window and follow the instructions. This will let Terraform use the default credential for authentication.

gcloud auth application-default login

Install terraform

Follow the instructions here to install the Terraform CLI locally. Run the following command afterward to check your installation:

terraform -v

You should see something like this:

Terraform v1.0.0
on darwin_amd64
+ provider registry.terraform.io/hashicorp/google v3.71.0

Fork or clone this repo locally

You can fork this repo to your account or clone it to your local machine. To clone the repo, run the following:

git clone https://github.com/tuanchris/modern-data-stack
cd modern-data-stack

Create a terraform.tfvars file

Create a terraform.tfvars file with the following content:

# The Billing ID from the first step
billing_id = ######-######-######
# The folder ID of where you want your project to be under
# Leave this blank if you use a personal account
folder_id  = ""
# The organization ID of where you want your project to be under
# Leave this blank if you use a personal account
org_id     = ""
# The project to create
project_id = ""

Warning: These are considered sensitive values. Do not commit this file and the *.tfstate files to a public repo.

Customize the values in variables.tf

The variables in variables.tf will be used for the configurations of the resources.

Image by author.

You can customize the machine type for different services by changing the variables. If you don’t want to use any service, comment it out in the gce.tf file.

You can also create different datasets for your source systems by adding them to the sources datasets dictionary.

Create a modern data stack

Finally, to provision all these resources on Google Cloud, run the following command:

terraform apply

Image by author.

Study the output in the terminal to make sure that all resource settings are what you want them to be. Type yes and hit enter.

Terraform will create a Google Cloud project with our modern data stack. The whole process will take about 2–3 minutes. It takes an additional 2–3 minutes for the services to be installed on VM instances. The whole process will only take 5 minutes or less.

Using the modern data stack

Retrieve service accounts for different services

Image by author.

Google recommends using a different service account for different services. The terraform code in the project has created different accounts for different technologies used already. To retrieve a service account for a particular service, run the following command:

terraform output [service_name]_sa_key

The default permission for all these accounts is roles/bigquery.admin. You can customize this in the iam.tf file.

The value you got back is base64 encoded. To turn this value back to the JSON format, run the following command:

echo "[value from the previous command]" | base64 -d

You can use the JSON service account to authenticate the service access to your project’s resources.

Warning: Anyone with this service account can access your project.

Ingest data with Airbyte

Airbyte is an excellent open-source data integration tool. To access the Airbyte UI, first, get the gcloud SSH command.

Image by author.

You will get a command similar to this:

gcloud beta compute ssh --zone "asia-southeast1-a" "tf-airbyte-demo-airbyte" --project "tf-airbyte-demo"

Next, add the following to the command to port-forward the Airbyte UI locally:

-- -L 8000:localhost:8000 -L 8001:localhost:8001 -N -f

Your final command will look like this:

gcloud beta compute ssh --zone "asia-southeast1-a" "tf-airbyte-demo-airbyte"  --project "tf-airbyte-demo" -- -L 8000:localhost:8000 -L 8001:localhost:8001 -N -f

Note: Be sure to delete the newline character after copying from the GCP UI.

If the Airbyte instance has finished starting up, you can access it by going to your browser and visit localhost:8000. If not, wait five minutes for the instance to complete the installation.

Image by author.

Now you can integrate your sources, add a BigQuery destination using the airbyte_sa_key, and have your data in BigQuery in no time.

You can access the Airbyte installation at /airbyte/ inside the VM.

Model data with dbt

dbt (data build tool) is a powerful open-source data transformation tool using SQL. It enables Data Analysts to do the work previously reserved for Data Engineers. It also helps create an entirely new position called Analytics Engineer, a hybrid of a Data Analyst and a Data Engineer. You can read more about the position in my blog here.

Image by author.

Unlike Airbyte, Airflow, and Metabase, you don’t need a server to run dbt. You can register for a free (forever) 1-seat account by visiting their website.

Orchestrate workflow with Airflow

Airflow is a battle-proven workflow orchestration tool created by Airbnb. With a modern data stack, hopefully, you won’t have to use Airflow a lot. However, in some cases where some customization is needed, Airflow can be your go-to tool.

To access the UI, get the SSH command similar to the above section with Airbyte. Use the following command for port-forward:

gcloud beta compute ssh --zone "asia-southeast1-a" "tf-airbyte-demo-airflow"  --project "tf-airbyte-demo" -- -L 8080:localhost:8080 -N -f

Now you can access the Airflow installation at localhost:8080. The default username & password are admin and admin.

Image by author.

You can access the airflow installation at /airflow/ inside the VM.

Visualize data with Metabase

Metabase is an open-source data visualization and discovery tool. It is super user-friendly and easy to get started with.

To access the Metabase UI, get the SSH command similar to the above section with Airbyte. Then, use the following command for port-forward:

gcloud beta compute ssh --zone "asia-southeast1-a" "tf-airbyte-demo-metabase"  --project "tf-airbyte-demo" -- -L 3000:localhost:3000 -N -f

Image by author.

Clean up

To avoid any unwanted cost incurred, be sure to clean up the resources created in this project by running.

terraform destroy

Warning: This will delete any persisted data and resources in the project. Alternatively, you can turn off the unused GCE to save costs as well.

The post Bootstrap a Modern Data Stack in 5 minutes with Terraform appeared first on Big Data, Data Analytics, IOT, Software Testing, Blockchain, Data Lake – Submit Your Guest Post.

Leave a Comment

Get the BPI Web Feed

Using the HTML code below, you can display this Business Process Incubator page content with the current filter and sorting inside your web site for FREE.

Copy/Paste this code in your website html code:

<iframe src="https://www.businessprocessincubator.com/content/bootstrap-a-modern-data-stack-in-5-minutes-with-terraform/?feed=html" frameborder="0" scrolling="auto" width="100%" height="700">

Customizing your BPI Web Feed

You can click on the Get the BPI Web Feed link on any of our page to create the best possible feed for your site. Here are a few tips to customize your BPI Web Feed.

Customizing the Content Filter
On any page, you can add filter criteria using the MORE FILTERS interface:

Customizing the Content Filter

Customizing the Content Sorting
Clicking on the sorting options will also change the way your BPI Web Feed will be ordered on your site:

Get the BPI Web Feed

Some integration examples

BPMN.org

XPDL.org

×