Blog Posts

Big data streaming analytics case with Apache Kafka, Spark (Flink) and BI systems

Blog: Think Data Analytics Blog

Today we will consider an example of building a big data streaming analytics system based on Apache Kafka , Spark , Flink , NoSQL DBMS, Tableau BI system or visualization in Kibana. 

Read on to find out who and why should investigate Twitter posts in real time, how to implement it technically, visualize it in visual BI dashboards for making data-driven decisions, and what does the Kappa architecture have to do with it.

Once Again about Big Data Analytics for Business: Marketing Problem Setting

Advertising and marketing are still the largest consumers of Big Data and data science technologies.… Moreover, modern business not only seeks to satisfy the emerging need of the client, but also to form it by stimulating demand or anticipating the desires of the consumer. 

For example, visitors to recreation parks, summer festivals and outdoor sports events are interested in fast delivery of picnic groceries or ready-to-eat meals. You can identify a potential client using an online analysis of his activity on social networks. 

Visit here: Top Big Data Companies

For example, hashtags # rest, # parkgorky, # weekend, etc. under the photos on Instagram or Twitter, along with geolocation data, they indicate that right now a person is walking in a specific area and, possibly, depending on the weather, will be happy to drink hot coffee or cool green tea, having a hearty burger or healthy lifestyle lunch. 

Sure, if the user is not at the cafe at the same time, i.e. the message lacks hashtags # cafe, # lunch, # summertime, etc. By analyzing such posts and tweets in real time, a food tech company can significantly increase its profits due to such ad hoc sales.

Thus, the key capabilities of the big data streaming analytics system for this case will be the following:

How to implement this in practice, we will consider further.

Big Data Streaming Analytics Ml System Architecture

A typical Big Data system for the above-described need for Big Data has a classic Kappa-architecture , which allows relatively inexpensive processing of unique events in real time without in-depth historical analysis. Technically, this can be implemented as follows [1] :

In particular, the Twitter API allows you to receive data in real time, process it and transfer it further along the processing pipeline, which will look like this:

However, it is possible to implement such a system of online big data analytics not only with the help of the Big Data technologies noted in the figure. Read on to see what alternatives are possible for each of the components described.

Apache Kafka And Other Implementation Technologies

The complexity of connecting the system components to each other and the availability of ready-made integration connectors can become a criterion for choosing a particular framework. 

For example, in October 2020, the Greenplum-Spark Connector 2.0 was released, which we talked about here . 

And you can connect the same Greenplum MPP DBMS with Apache Kafka using the Greenplum Stream Server (GPSS) or the PXF (Platform eXtension Framework) Java framework, which we discussed in this article . 

And about the features of creating your own Apache Spark connector to the Tableau BI system, read this article .

In addition, the necessary functional and non-functional requirements for this system component can be used as criteria for choosing an analytical DBMS. For example, Elasticsearch has almost instant indexing of new data in JSON and other semi-structured formats with fuzzy search support and ML modules, which we mentioned here . 

And built-in integration with Kibana will allow you to visualize the results of data analytics, as was done in the ad conversion analysis case study. 

The advantage of this solution is that there are no costs for the commercial license of the Tableau BI system – instead, the Apache Kafka bundle with the ELK stack components is used.  (Elasticsearch, Logstash, Kibana). And the implementation of machine learning algorithms is responsible for the PySpark code in the Spark framework [2] .

However, Apache Flink provides similar capabilities, which can be used instead of Spark if you need fast data processing in real time. 

Similar to Spark, Flink also provides SQL modules and Machine Learning libraries, incl. a set of Alink algorithms. Like Spark, Flink allows you to write code in Java, Scala and Python with improved performance thanks to the updates in the latest 1.13.0 release, released in May 2021 [3] . For answers to the question “Apache Spark vs Flink” (what are the similarities and differences between these distributed frameworks), see our separate article .

You will learn technical details of the implementation of the considered case and other similar examples of streaming big data analytics based on Apache Kafka, Spark and Greenplum in specialized courses in our licensed training and professional development center for developers, managers, architects, engineers, administrators, Data Scientists. and Big Data analysts in Moscow:

The post Big data streaming analytics case with Apache Kafka, Spark (Flink) and BI systems appeared first on ThinkDataAnalytics.

Leave a Comment

Get the BPI Web Feed

Using the HTML code below, you can display this Business Process Incubator page content with the current filter and sorting inside your web site for FREE.

Copy/Paste this code in your website html code:

<iframe src="" frameborder="0" scrolling="auto" width="100%" height="700">

Customizing your BPI Web Feed

You can click on the Get the BPI Web Feed link on any of our page to create the best possible feed for your site. Here are a few tips to customize your BPI Web Feed.

Customizing the Content Filter
On any page, you can add filter criteria using the MORE FILTERS interface:

Customizing the Content Filter

Customizing the Content Sorting
Clicking on the sorting options will also change the way your BPI Web Feed will be ordered on your site:

Get the BPI Web Feed

Some integration examples