Blog Posts Business Management

DStreams vs. DataFrames: Two Flavors of Spark Streaming

Blog: NASSCOM Official Blog

This post is a guest publication written by Yaroslav Tkachenko, a Software Architect at Activision.

Apache Spark is one of the most popular and powerful large-scale data processing frameworks. It was created as an alternative to Hadoop’s MapReduce framework for batch workloads, but now it also supports SQL, machine learning, and stream processing. Today I want to focus on Spark Streaming and show a few options available for stream processing.

Stream data processing is used when dynamic data is generated continuously, and it is often found in big data use cases. In most instances data is processed in near-real time, one record at a time, and the insights derived from the data are also used to provide alerts, render dashboards, and feed machine learning models that can react quickly to new trends within the data.

DStreams Vs. DataFrames

Spark Streaming went alpha with Spark 0.7.0. It’s based on the idea of discretized streams or DStreams. Each DStream is represented as a sequence of RDDs, so it’s easy to use if you’re coming from low-level RDD-backed batch workloads. DStreams underwent a lot of improvements over that period of time, but there were still various challenges, primarily because it’s a very low-level API.

As a solution to those challenges, Spark Structured Streaming was introduced in Spark 2.0 (and became stable in 2.2) as an extension built on top of Spark SQL. Because of that, it takes advantage of Spark SQL code and memory optimizations. Structured Streaming also gives very powerful abstractions like Dataset/DataFrame APIs as well as SQL. No more dealing with RDD directly!

Both Structured Streaming and Streaming with DStreams use micro-batching. The biggest difference is latency and message delivery guarantees: Structured Streaming offers exactly-once delivery with 100+ milliseconds latency, whereas the Streaming with DStreams approach only guarantees at-least-once delivery, but can provide millisecond latencies.

I personally prefer Spark Structured Streaming for simple use cases, but Spark Streaming with DStreams is really good for more complicated topologies because of its flexibility. That’s why below I want to show how to use Streaming with DStreams and Streaming with DataFrames (which is typically used with Spark Structured Streaming) for consuming and processing data from Apache Kafka. I’m going to use Scala, Apache Spark 2.3, and Apache Kafka 2.0.

Also, for the sake of example I will run my jobs using Apache Zeppelin notebooks provided by Qubole. Qubole is a data platform that I use daily. It manages Hadoop and Spark clusters, makes it easy to run ad hoc Hive and Presto queries, and also provides managed Zeppelin notebooks that I happily use. With Qubole I don’t need to think much about configuring and tuning Spark and Zeppelin, it’s just handled for me.

The actual use case I have is very straightforward:

Streaming With DStreams

In this approach we use DStreams, which is simply a collection of RDDs.

Streaming With DataFrames

Now we can try to combine Streaming with DataFrames API to get the best of both worlds!

Conclusion

Which approach is better? Since DStream is just a collection of RDDs, it’s typically used for low-level transformations and processing. Adding a DataFrames API on top of that provides very powerful abstractions like SQL, but requires a bit more configuration. And if you have a simple use case, Spark Structured Streaming might be a better solution in general!

The post DStreams vs. DataFrames: Two Flavors of Spark Streaming appeared first on NASSCOM Community |The Official Community of Indian IT Industry| :)).

Leave a Comment

Get the BPI Web Feed

Using the HTML code below, you can display this Business Process Incubator page content with the current filter and sorting inside your web site for FREE.

Copy/Paste this code in your website html code:

<iframe src="https://www.businessprocessincubator.com/content/dstreams-vs-dataframes-two-flavors-of-spark-streaming/?feed=html" frameborder="0" scrolling="auto" width="100%" height="700">

Customizing your BPI Web Feed

You can click on the Get the BPI Web Feed link on any of our page to create the best possible feed for your site. Here are a few tips to customize your BPI Web Feed.

Customizing the Content Filter
On any page, you can add filter criteria using the MORE FILTERS interface:

Customizing the Content Filter

Customizing the Content Sorting
Clicking on the sorting options will also change the way your BPI Web Feed will be ordered on your site:

Get the BPI Web Feed

Some integration examples

BPMN.org

XPDL.org

×