Modern Data Platform – How to build one?
Blog: Think Data Analytics Blog
A modern data platform is a set of cultural principles, tools and capabilities that enables organizations to fundamentally become data driven.
The mission is to create delightful customer experiences and democratize data and analytics for business outcomes. Modern Data Platform broadly solves for two kinds of customers.
External customers or end users who are enabled by personalized experiences across all touchpoints – web, mobile, email, etc. – and internal customers who are provided frameworks, infrastructure and tools that help them access, manage, and store data.
The cultural principles require a transition to a Data Mesh with Data Domains that are enabled by federated ownership and autonomy to innovate through self-service tools.
The self-service tools are a set of capabilities, frameworks and infrastructure across operational and analytical systems which includes compute for both data in motion & at rest and storage for transient and persistence data systems.
A modern architecture brings together the culture and capabilities and defines the interaction model of distinct data layers of Data Access Frameworks and Data Acquisition Frameworks to feed data as Entities and Events into a Lakehouse for real time big data processing.
On the two sides of the Data Platform are Data Producers and Data Consumers. On the left are data producers which generate data from customer interactions.
A well architected digital website or product is generally decomposed into application domains which powered by Microservices that generate domain specific data.
Some examples of domains are “Sign-Up” which usually deals with customers signing up, “Cart” which models a shopping cart, “Payments” that captures the payment information and “Orders” which deals with shopping orders.
These interactions usually generate data which is captured in source operational and transactional data systems.
On the right are data consumers. Data is organized into Data Domains to provide autonomy, speed, and accountability. Consumers are data analysts who generate insights, data scientists who build predictive models and business users who interact with KPI dashboards.
There exists a flywheel between producers and consumers orchestrated by the Data Platform where source data flows from application domains into the Data Platform which in turn enriches it to create derived data that flows back to the application domains.
Double clicking into the Data Platform, we have three pillars, each catering to distinct personas:
1. Operational Data – which provides robust, resilient, and integrated polyglot persistence and transient ecosystem for Application Developers. Data Platform needs to solve for the complexity by abstracting the data access to the operational stores through frameworks and patterns so that the application developer can focus on business logic.
2. Data-in-motion – which provides tools to build data pipelines for Data Engineers that can accurately move data across operational data stores and big data ecosystems connecting data producers to data consumers. The data engineers are looking for self-service and cost-effective messaging, streaming and batch capabilities to process, enrich, transform data as it moves from source systems into destinations.
3. Big Data – which provides capabilities for analytics and machine learning to enable Analysts, Scientists and Machine Learning engineers so they can train, build, and deploy models for the business use cases.Analysts are looking for near real time solutions to generate insights, while data scientist want to develop and deploy models to perform batch and real time inferences.
Foundations – At the bottom is a horizontal layer that enables the three pillars. The foundational layer includes key “iilites” – availability, reliability, quality, cost, governance, compliance, privacy, and security. These are fundamental for the operation of a modern data platform.
Customer Benefits – “It is always about the customer”
The big question that gets asked is what you are solving for in a modern data platform. What are the customer benefits? True north star is to make the customer experience delightful with greater understanding of their behavior and friction points. Customers need to feel confident because the product understand their needs, with the interactions fast as AI predicts their intent.
The customer benefits boil down to the following 10.
- Speed – Enable application developers, data engineers, and data scientists to innovate with speed to serve the end customers.
- Simplicity – Provide simple easy to use frameworks, capabilities, and patterns to access best of breed data services ideally agnostic of any vendor.
- Efficiency – Modernize legacy workloads to erase technical debt, realizing hybrid cloud elasticity, scale, and efficiency.
- Cost – Optimize data placement workloads ensuring high throughput.
- Open Data Access – Provide open interfaces to ingest and consume data – REST, File system, Storage or APIs.
- Virtual Data Access – Has the ability to pull disparate data sources across locations and infrastructures by providing management, operations, and discovery of data sets, as well as making data sets portable between infrastructures.
- Affinity – Ensure data availability for regional computing near the edge, with support for localized experiences, geographical distribution, and data sovereignty.
- Security and Compliance – Securing data is a priority. Security and compliance safeguards in conformance to the enterprise’s security and privacy policies must be embedded into the platform.
- Democratization – democratize data through self-service tools, services and frameworks for data access, data acquisition and movement.
- Business Continuity – Enable reliability and availability of data via support for in-region and cross region disaster recovery for business continuity.
Looking at the spectrum of products that are in each of the three pillars
- Data Access Frameworks – Provide transparent, optimized data access and integration with SQL and No-SQL systems and an entry point to the data ecosystem by capturing metadata and managing polyglot data systems transparently.
- SQL (RDBMS) referential integrity, immediate consistency, transactional patterns, and geo distributed access.
- NoSQL – Key Value, Columnar, Document, Search, Graph Stores with tunable consistency and horizontal scaling – and NewSQL which combines strong consistency from RDBMS and scalability, global distribution and synchronous replication from NoSQL.
- Cache – High-Volume, low-latency data access in memory stores with optional durability.
- Data Acquisition Frameworks – which enables seamless data acquisition & distribution and acts as an interface with the data-in-motion layer.
- Messaging – Tools and technology to address both distributed (broadcast) and transactional (point-to-point) messaging.
- Streaming – Self-service stream processing platform to support real-time analytics and data-marts.
- Schema Registry – Cross component schema registration to model key business entities and business state transition events for streaming and batch workloads.
- Data Pipelines – ETL and ELT tools to move data real time or in batches.
- File Transfer – Tools for ad hoc/batch movement of files across geos and data centers.
- Data Quality – Frameworks to monitor, detect and resolve data quality issues.
- Data Lake (Storage) – A repository that provide storage for all structured, semi-structured and unstructured data.
- Processing – Big data processing using frameworks like Hadoop and Spark. The processing is getting decoupled from storage.
- Machine Learning – Collaborative platform for data scientists to train and deploy models.
- Data Catalog – Provides unified view of alldata assets by capturing all metadata. Enables data discovery.
- Data Query – Using SQL-like queries interactively analyze petabytes of data.
- Tools – self-service tools BI and AI (including Notebooks) to create visualization, generate insights and build predictive models.
The modern data platform also needs to prescribe how data flows from producers to consumers. There are several challenges in the current architecture.
- Application domains usually directly interact with operational data systems. This can limit scale, infrastructure portability and creates vendor lock-ins. This also means that important problems of data partitioning, high availability (HA) and disaster recovery (DR) need to be solved multiple times.
- The data movement is through point-to-point data integrations using multiple disparate methods like change data capture, log replication, custom queues, and DB batch extracts.
- The data usually is pushed into the lake using data pipelines that are centrally managed and loosely governed with lack of schema for business entities and events. This shifts the data governance and data quality problems downstream where data needs to be cleaned before being consumed.
- Data ingestion follows the traditional data Warehouse pattern where centralized data pipelines move data into a monolith enterprise lake from source operational systems. The data once lands into the lake undergoes layers of batch processing and transformation and then moved to data warehouse. Consumer specific pipelines power data scientists, analysts, and business intelligence.
The modern data platform solves for most of the challenges. You first need a cultural shift to move away from a centralized ownership of data pipelines, ingestion, transformation, and storage in a monolith lake, to a Data Mesh that supports decentralized domain-specific data views with each domain handling their own data pipelines. Data mesh is more about the culture than technology.
Data mesh at the simplest level is microservices for data platform. Data meshes provide a solution to the shortcomings of data lakes by allowing greater autonomy and flexibility for data owners, facilitating greater data experimentation and innovation while lessening the burden on data teams to field the needs of every data consumer through a single pipeline.
Data mesh has four key principles:
- Domain-oriented decentralized data ownership – decentralizes data ownership with domain data product owners who are held accountable for providing their data as products. The domain driven design and bounded context which has been applied in architecting Microservices in operational systems are applied in the context of data by moving away from a centralized data lake to decentralized domains.
- Data as a product – provides high-quality data with clear service level objectives (SLO) to the consumers. This ensures that clean data flows from the source systems into the lake.
- Federated governance – provides autonomy to domain data product owners to model and evolve their domains around a standard set of metadata rules which can be programmatically executed by the platform. The rules also cover how to model Entities – with Polysemes which are data elements that cross the boundaries of multiple domains – and Events which capture business state changes. Entities and (business) Events are governed with each domain being responsible for data stewardship.
- Self-serve insights – allows analysts, business users and data scientists to consume data products autonomously. The key principles applied to the domain model is that data is discoverable, addressable, trustworthy, interoperable and secure.
In this model each domain is responsible to manage their pipelines both to power business intelligence and to create feedback loops back from big data to operational systems. The data platform in turn is responsible for providing each domain with a set of capabilities to move data – manage ingestion, cleaning, enrichment, and aggregation. Data platform provides:
- Scalable polyglot big data storage.
- Schema registry to manage and version data products.
- Orchestration pipelines tools to move data.
- Batch and streaming capabilities.
- Data catalog for data governance and lineage.
- Data discovery tools to query and access data.
Once the Data Mesh paradigm is established, modern data platform needs to focus on defining standard ways to acquire, move and process data.
As data flows from left to right:
- The Data Access Layer provides transparent hybrid multi-cloud application persistence by storing the system of record and the state change which is propagated through change data capture (CDC).
- Data Acquisition Layer transforms the message into a canonical domain Event or Entity object which validates against the schema registry – Data as an Event or Entity. An Entity usually is self-contained that fully encapsulates a domain object while an Event is generated as part of a business process to message state change. The schema definitions for Entities and Events are modeled and governed through a schema registry.
- A logical domain layer in the lake allows data domains to enrich, process and transform data to create derived data (see Lakehouse below). The enriched data is pushed back to the source operational serving systems to create delightful experiences while the transformed data is fed through a data catalog to power all analytics and standardized reports.
The logical domain layer in Big Data uses a physical paradigm called Lakehouse, which is an open data management architecture that combines the scale and cost efficiency of data lakes with the data management and ACID transactions of data warehouses.
The traditional data platforms follow a two-tier architecture, data is first ingested into the lake, and then again ingested into the warehouse. This creates several challenges and adds complexity to keep data lake and warehouse consistent.
There is dual cost for storage and compute paying for multiple hops and copies of data in proprietary formats. Additionally, data lakes lack basic management features such as transactions (ACID) and efficient access methods such as indexes to match data warehouse performance.
Furthermore each stage adds latency as data is moved in batches using ETL pipelines. Finally, the consumption patterns are disjointed on lake and warehouse for data engineers and analysts respectively while the data scientist have limited tools across the two tiers .
The Lakehouse paradigm solves for these challenges:
- Simple (Storage) – It eliminates the two-tier architecture by bringing together the lake and warehouse with an open data format – Parquet, ORC, HUDI, Iceberg. This can deal with all kinds of data namely unstructured, semi-structured and structured. The zero-copy cloning eliminates dual storage and compute as there is no need to operationalize two copies of data in both a data lake and a warehouse.
- ACID – provides a metadata driven layer such as transactions, ACID compliance, data versioning and rollbacks. The metadata layer defines which objects are part of a table version to support ACID transactions. Additionally, it also provides schema enforcement to ensure that the data uploaded to a table matches its schema and set constraints on the ingested data. Finally, it provides governance features such as access control and audit logging.
- Freshness (Streaming) – data is available in milliseconds for continuous stream processing and analytics, with growing support for running SQL on streams.
- Unified consumption – All the key personas namely data engineers, scientists, analysts, and business users are enabled through a unified consumption layer. The bronze layer is the raw enriched data, the silver layer is transformed catalog data while the gold layer enables domain data to be exposed as products to consumers to power analytics, business intelligence and reporting.
To summarize, in order build a modern data platform, you need to culturally move your organization towards a Data Mesh. Invest in data layers which captures clean data from the sources through federated governance of Entities and Events, using frameworks for Data Access & Acquisition.
Enable consumers by building well-defined domain data products using a Lakehouse. Finally don’t forget to invest in the foundations, namely the key “illites” – availability, reliability, quality, security – cost, governance, compliance and privacy.