Blog Blog Posts Business Management Process Analysis

What is AWS Glue?

Amazon Glue has grown in popularity as many businesses started using managed data integration services. Mainly, Data engineers and ETL developers use Glue, to create, run, and monitor ETL workflows.

So, before moving on to “AWS Glue”, it’s better to brush up on your ETL concepts. Please refer to this blog: What is ETL for more details. We’ll go through the following topics in-depth in this Amazon Glue blog:

To get started, Watch this informative AWS Glue Tutorial YouTube Video:

 

What is AWS Glue?

AWS Glue is a serverless data integration and ETL service that makes discovering, preparing, and combining data for data analysis, Machine Learning, and application development simple. To enable the data integration process smoother, Glue offers both visual and code-based tools. 

Amazon Glue consists of three components namely, the AWS Glue Data Catalog, an ETL engine that creates Python or Scala code automatically, and a configurable scheduler that manages dependence resolutions, task monitoring, and restarts. 

The Glue Data Catalog allows users to quickly locate and retrieve data. Customization, orchestration, and monitoring of complicated data streams are also available through the Glue service.

Learn in-depth about AWS through our AWS tutorial

 

AWS Glue Pricing

Amazon Glue has a starting price of $0.44. There are four distinct plans available here:

There is no free plan for the Glue service in AWS. It will cost about $0.44 per DPU each hour. So, on average, you’d have to spend $21 each day. However, pricing can vary by region. 

 

When to Use AWS Glue?

Knowing all the information about Amazon Glue is not enough, you should also know where to use it. Here are some AWS Glue use cases you need to consider.

 

Features of AWS Glue

Amazon Glue offers all of the features you’ll need for data integration so that you can obtain insights and put your knowledge to create new advancements in minutes rather than months. The following are some features you need to know.

 

AWS Glue Components

Before understanding the architecture of Glue, we need to know about a few components. To design and maintain your ETL workflow, AWS Glue relies on the interaction of multiple components. The following are the key components of Glue architecture.

AWS Glue Data Catalog
Glue Data Catalog is where permanent metadata is stored. To maintain your Glue environment, it provides table, job, and other control data. AWS offers one Glue Data Catalog for each account in every region. 

Classifier
A classifier is the schema of your data that is determined by the classifier. AWS Glue provides classifiers for common relational database management systems and file types, such as CSV, JSON, AVRO, XML, and others.

Connection
AWS Glue Connection is the Data Catalog object that holds the characteristics needed to connect to a certain data storage.

Crawler
It is a component that crawls various data stores in a single encounter. It determines the schema for your data using a prioritized set of classifiers and then generates metadata tables in the Glue Data Catalog.

Database
A formal group of Data Catalog table definitions that are linked together is known as a database.

Data Store
A data storage is a location where you can keep your data for a long time. Relational databases and Amazon S3 buckets are two examples. 

Data Source
A data source is a collection of data that is utilized as input to a process or transformation. 

Data Target
A data target is data storage where the job writes the transformed data. 

Transform
Transform is the logic in the code that is utilized to change the format of your data.

Development Endpoint
You can use the development endpoint environment to build and test your AWS Glue ETL programs.

Dynamic Frame
A DynamicFrame is identical to a DataFrame, except each entry is self-describing. Therefore, there is no need for a schema at first. Additionally, Dynamic Frame comes with a suite of sophisticated data cleansing and ETL processes.

Job
AWS Glue Job is a business logic that is necessary for ETL work. A transformation script, data sources, and data targets are the components of a job. 

Trigger
Trigger starts an ETL process. Triggers can be set to occur at a specific time or in response to an event.

Notebook Server
It is a web-based environment for running PySpark commands. On a development endpoint, a notebook allows the active creation and testing of ETL scripts. 

Script
A script is a piece of code that extracts data from sources, changes it, and loads it into destinations. PySpark or Scala scripts are generated using AWS Glue. Notebooks and Apache Zeppelin notebook servers are offered by Amazon Glue.

Table
In data storage, a table is the metadata definition that describes the data. The names of columns, data type definitions, partition information, and other metadata about a base dataset are all stored in a table. 

Moving on, let’s see how AWS Glue works.

 

AWS Glue Architecture

The architecture of Glue is depicted in the figure below.

Architecture of AWS Glue

In AWS Glue, you define jobs to do the process of extracting, transforming, and loading (ETL) data from a data source to a data destination. The following are the steps you need to follow:

Are you preparing for a job interview? Visit our AWS Interview Questions blog for more information.

 

Advantages and Disadvantages of AWS Glue

Like anything else in the world of big data computing, AWS Glue also has both advantages and disadvantages. 

Here are some benefits of AWS Glue:

While Glue has a lot of interesting features, it also has certain drawbacks. So, we’ll look into some of the AWS Glue limitations.

 

Conclusion

We explored AWS Glue through this post, which is a strong cloud-based solution for working with ETL pipelines. There are just three key phases to the user interaction procedure. You begin by using data crawlers to create a data catalog. Then you write the ETL code that the data pipeline requires. Finally, you build the ETL work schedule.

We hope you have got a complete understanding of Amazon Glue through this blog.

If you still have any questions or concerns about this technology, please post them on the AWS Community.

Certification in Cloud & Devops

The post What is AWS Glue? appeared first on Intellipaat Blog.

Blog: Intellipaat - Blog

Leave a Comment

Get the BPI Web Feed

Using the HTML code below, you can display this Business Process Incubator page content with the current filter and sorting inside your web site for FREE.

Copy/Paste this code in your website html code:

<iframe src="https://www.businessprocessincubator.com/content/what-is-aws-glue/?feed=html" frameborder="0" scrolling="auto" width="100%" height="700">

Customizing your BPI Web Feed

You can click on the Get the BPI Web Feed link on any of our page to create the best possible feed for your site. Here are a few tips to customize your BPI Web Feed.

Customizing the Content Filter
On any page, you can add filter criteria using the MORE FILTERS interface:

Customizing the Content Filter

Customizing the Content Sorting
Clicking on the sorting options will also change the way your BPI Web Feed will be ordered on your site:

Get the BPI Web Feed

Some integration examples

BPMN.org

XPDL.org

×