Blog Posts Business Management

Cloud Data Lakes – Best Practices

Blog: NASSCOM Official Blog

BI tools have been the go-to for data analysts who help business track top line, bottom line and customer experience metrics. BI tools analyze small sets of relational data (a few terabytes) in a data warehouse, which require small data scans (a few gigabytes) to execute.

But, businesses are now looking beyond BI to interactive, streaming and clickstream analytics, machine learning and deep learning in order to gain the data-led advantage. For these types of analytics applications, data lakes are the preferred option. Data lakes can ingest volume, variety and velocity of data and stage and cataloged them centrally. Data is then made available for a variety of analytics applications, at any scale, in a cost-efficient manner.

Let’s look at best practices in setting up and managing data lakes across three dimensions –

  1. Data ingestion,
  2. Data layout
  3. Data governance

Cloud Data Lake – Data Ingestion Best Practices

Ingestion can be in batch or streaming form. The data lake must ensure zero data loss and write exactly-once or at-least-once. The data lake must also handle variability in schema and ensure that data is written in the most optimized data format into the right partitions, and provide the ability to re-ingest data when needed.

Apart from batch and stream ingestion modes, data lakes must also provide for

Cloud Data Lake – Data Layout Best Practices

Data generation and data collection across semi-structured and unstructured formats is both bursty and continuous. Inspecting, exploring and analyzing these datasets in their raw form is tedious, because the analytical engines scan the entire data set across multiple files. We recommend five ways to reduce data scanned and reduce query overheads –

Managed data lakes can deliver autonomous data management capabilities to operationalize the aforementioned data layout strategy.

Cloud Data Lake – Data Governance Best Practices

With data lakes, multiple teams will start accessing data. There needs to be a strong focus on oversight, regulatory compliance and role-based access control along with delivering meaningful experiences. A single interface for configuration management, auditing, obtaining job reports and exercising cost control is key. Here are three recommendations for data governance –

Discover Your Data

Having a data catalog helps users discover and profile datasets for integrity by enriching metadata through different mechanisms, document datasets, and support a search interface

Regulatory And Compliance Needs

New or expanded data privacy regulations, such as GDPR and CCPA, have created new requirements around Right to Erasure and Right to Be Forgotten. Therefore, the ability to delete specific subsets of data without disrupting a data management process is essential. In addition to the throughput of DELETE itself, you need support for special handling of PCI/PII data, and auditability.

Permissioning And Financial Governance

Using the Apache Ranger open source framework that facilitates table, row and column level granular access, architects can grant permissions against already-defined user roles in the identity and access management (IAM) access solutions of cloud service providers. With wide-ranging usage, monitoring and audit capabilities are essential to detect access violations and flag adversarial queries. To give P&L owners and architects a bird’s eye view of usage, they need cost attribution and exploration capabilities at the cluster-, job- and user-level from a single interface

Conclusion

The data lake best practices can help build a sustainable advantage using the data you collect. A cloud data lake can break down data silos and facilitate multiple analytics workloads at scale and at lower costs.

P.S – This article was first published on https://www.qubole.com/

The post Cloud Data Lakes – Best Practices appeared first on NASSCOM Community |The Official Community of Indian IT Industry.

Leave a Comment

Get the BPI Web Feed

Using the HTML code below, you can display this Business Process Incubator page content with the current filter and sorting inside your web site for FREE.

Copy/Paste this code in your website html code:

<iframe src="https://www.businessprocessincubator.com/content/cloud-data-lakes-best-practices/?feed=html" frameborder="0" scrolling="auto" width="100%" height="700">

Customizing your BPI Web Feed

You can click on the Get the BPI Web Feed link on any of our page to create the best possible feed for your site. Here are a few tips to customize your BPI Web Feed.

Customizing the Content Filter
On any page, you can add filter criteria using the MORE FILTERS interface:

Customizing the Content Filter

Customizing the Content Sorting
Clicking on the sorting options will also change the way your BPI Web Feed will be ordered on your site:

Get the BPI Web Feed

Some integration examples

BPMN.org

XPDL.org

×