Blog Posts

100 open source Big Data architecture papers for data professionals

Blog: Think Data Analytics Blog

Big Data technology has been extremely disruptive with open source playing a dominant role in shaping its evolution.

While on one hand it has been disruptive, on the other it has led to a complex ecosystem where new frameworks, libraries and tools are being released pretty much every day, creating confusion as technologists struggle and grapple with the deluge.

If you are a Big Data enthusiast or a technologist ramping up (or scratching your head), it is important to spend some serious time deeply understanding the architecture of key systems to appreciate its evolution.

Understanding the architectural components and subtleties would also help you choose and apply the appropriate technology for your use case.

In my journey over the last few years, some literature has helped me become a better educated data professional. My goal here is to not only share the literature but consequently also use the opportunity to put some sanity into the labyrinth of open source systems.  

One caution, most of the reference literature included is hugely skewed towards providing a deep architecture overview (in most cases it refers to the original research papers).

I firmly believe that deep dive will fundamentally help you understand the nuances, though would not provide you with any shortcuts, if you want to get a quick basic overview.

Jumping right in…

Key architecture layers

Architecture Evolution

The modern data architecture has evolved with a goal of reduced latency between data producers and consumers. This consequently has lead to real time, low latency processing, bridging the traditional batch and interactive layers into hybrid architectures like Lambda and Kappa.

Before you deep dive into the actual layers, here are some general documents which can provide you a great background on NoSQL, Data Warehouse Scale Computing and Distributed Systems.

File Systems 

As the focus shifted to low latency processing, there was a shift from traditional disk based storage file systems to an emergence of in memory file systems — which drastically reduced the I/O & disk serialization cost. Alluxio(Tachyon) and Spark RDD are examples of that evolution.

File Systems have also seen an evolution on the file formats and compression techniques. The following references gives you a great background on the merits of row and column formats and the shift towards newer nested column oriented formats which are highly efficient for Big Data processing. Erasure codes are using some innovative techniques to reduce the triplication (3 replicas) schemes without compromising data recoverability and availability.

Data Stores

Broadly, the distributed data stores are classified on ACID & BASE stores depending on the continuum of strong to weak consistency respectively. BASE further is classified into KeyValue, Document, Column and Graph — depending on the underlying schema & supported data structure. While there are multitude of systems and offerings in this space, I have covered few of the more prominent ones. I apologize if I have missed a significant one…

BASE

Key Value Stores

Column Oriented Stores

Document Oriented Stores

Graph

ACID

I see a lot of evolution happening in the open source community that is catching up with what Google has done — 3 out of the prominent papers below are from Google , they have solved the globally distributed consistent data store problem.

Resource Managers

While the first generation of Hadoop ecosystem started with monolithic schedulers like YARN, it evolved towards hierarchical schedulers (Mesos), that can manage distinct workloads, across different kind of compute workloads, to achieve higher utilization and efficiency.

These are loosely coupled with schedulers whose primary function is schedule jobs based on scheduling policies/configuration.

Schedulers

Coordination

These are systems that are used for coordination and state management across distributed data systems.

Computational Frameworks

The execution runtimes provide an environment for running distinct kinds of compute. The most common runtimes are

Spark — its popularity and adoption is challenging the traditional Hadoop ecosystem.

Flink — very similar to Spark ecosystem; strength over Spark is in iterative processing.

The frameworks broadly can be classified based on the model and latency of processing

Batch

MapReduce — The seminal paper from Google on MapReduce.

MapReduce Survey — A dated, yet a good paper; survey of Map Reduce frameworks.

Iterative (BSP)

Streaming

Streaming Data Architecture Overview– O’Reilly report on the state of stream processing.

Interactive

RealTime

Data Analysis

The analysis tools range from declarative languages like SQL to procedural languages like Pig. Libraries on the other hand are supporting out of the box implementations of the most common data mining and machine learning libraries.

Tools

Machine Learning

Data Integration

Data integration frameworks provide good mechanisms to ingest and outgest data between Big Data systems. It ranges from orchestration pipelines to metadata framework with support for lifecycle management and governance.

Ingest/Messaging

Sqoop– a tool to move data between Hadoop and Relational data stores.

Kafka — distributed messaging system for data processing

ETL/Workflow

Metadata

Security

Serialization

ProtocolBuffers — language neutral serialization format popularized by Google. Avro — modeled around Protocol Buffers for the Hadoop ecosystem.

Operational Frameworks

Finally the operational frameworks provide capabilities for metrics, benchmarking and performance optimization to manage workloads.

Monitoring Frameworks

Benchmarking

Summary

I hope that the papers are useful as you embark or strengthen your journey. I am sure there are few hundred more papers that I might have inadvertently missed and a whole bunch of systems that I might be unfamiliar with — apologies in advance as don’t mean to offend anyone though happy to be educated….

Original

The post 100 open source Big Data architecture papers for data professionals appeared first on ThinkDataAnalytics.

Leave a Comment

Get the BPI Web Feed

Using the HTML code below, you can display this Business Process Incubator page content with the current filter and sorting inside your web site for FREE.

Copy/Paste this code in your website html code:

<iframe src="https://www.businessprocessincubator.com/content/100-open-source-big-data-architecture-papers-for-data-professionals/?feed=html" frameborder="0" scrolling="auto" width="100%" height="700">

Customizing your BPI Web Feed

You can click on the Get the BPI Web Feed link on any of our page to create the best possible feed for your site. Here are a few tips to customize your BPI Web Feed.

Customizing the Content Filter
On any page, you can add filter criteria using the MORE FILTERS interface:

Customizing the Content Filter

Customizing the Content Sorting
Clicking on the sorting options will also change the way your BPI Web Feed will be ordered on your site:

Get the BPI Web Feed

Some integration examples

BPMN.org

XPDL.org

×