[Webinar] How to Protect Sensitive Data with CSFLE | Register Today

Stream Processing Simplified: An Inside Look at Flink for Kafka Users

Written By

There was a huge amount of buzz about Apache Flink® at this year’s Kafka Summit London. From an action-packed keynote to standing-room only breakout sessions, it's clear that the Apache Kafka® community is hungry to learn more about Flink and how the stream processing framework fits into the modern data streaming stack.

That's why we're excited to introduce our new "Inside Flink" blog series that takes a deeper look at why developers and organizations everywhere are shifting their stream processing technologies to Flink. Our first blog post explains what Flink is and how it can enhance your streaming use cases running on Kafka. Future topics will include common Flink use cases, an inside look at Flink SQL, and much more.

If you’re interested in putting these topics into practice, we encourage you to apply to our Flink Early Access Program. The program gives Confluent Cloud users a chance to get their hands on an early version of our upcoming fully managed Flink service and help shape its roadmap – sign up for Confluent Cloud and apply today!

And with that, let's start off by breaking down the current stream processing landscape.

Enhancing your Kafka use cases with stream processing

Apache Kafka has become the de-facto standard for writing, reading, and sharing data streams. Data streams let you easily share high-quality business facts across your organization, which of course has intrinsic value by itself. But data is far more valuable when you can clean and enrich it, making it usable for downstream systems while enabling you to derive insights and drive actions.

That’s where stream processing comes in. Stream processing allows you to continuously consume data streams, process them with your own business logic, and produce new data streams for others to use. Stream processing use cases cover a wide range of applications including low-latency data pipelines, real-time dashboards, materialized views, machine learning models, and event-driven apps and microservices. We’ll take a more detailed look at stream processing use cases in part two of this series.

Stream processing enriches and reshapes data in Kafka to create reusable data streams that power real-time applications and pipelines

The complexity of processing logic varies depending on the use case, ranging from simple operations like filters and aggregations to more complex tasks such as multi-way temporal joins and arbitrary event-driven logic.

Therefore, the benefits of switching from alternatives (periodic batch jobs, ELT, classical two-tiered architecture) to stream processing naturally differ. Still, in practice, I’ve usually seen at least one of the following aspects be a key driver of adoption:

  • Latency: Stream processing significantly reduces the time between the occurrence of an event and when it is reflected in your product and user experience – be that an app, dashboard, machine learning model, or user-facing analytics.

  • Innovation and reusability: Stream processing transforms data products into shareable data streams, consumed, transformed, and built upon by downstream systems and applications. Data streams form a set of reusable data primitives, providing well-defined and consistent data access for innovating and powering new products and ideas.

  • Resource efficiency and cost: Continuous data processing increases resource utilization by distributing work over time. In addition, processing data upstream (e.g., pre-aggregation, sessionization) dramatically reduces costs in downstream systems (e.g., data warehouse, real-time analytics database) while at the same time speeding up queries in these systems.

  • Expressiveness: Life doesn’t happen in batches. In contrast to periodically scheduled batch jobs, stream processing does not introduce artificial boundaries in your data which overlap into your processing logic.

Four reasons why developers choose Apache Flink

As discussed in the previous section, a stream processor is as core to the streaming data stack as the database is to the traditional, data-at-rest stack. Let’s talk about Apache Flink in particular and why it has seen such widespread adoption in the last few years. 

Apache Flink is a unified stream and batch processing framework. It has been a top-five Apache project for many years and is on its way to becoming the de-facto standard for stream processing. In fact, its growth rate is following a similar trend to Kafka from four years ago:

Flink has been tracking with Kafka’s historical growth rates in terms of monthly unique users

Flink has a strong, diverse contributor community backed by companies like Alibaba and Apple. It powers stream processing platforms at many companies, including digital natives like Uber, Netflix, and Linkedin, as well as successful enterprises like ING, Goldman Sachs, and Comcast.

Why did these companies choose Apache Flink over other stream processing frameworks? Let’s cover four reasons that companies rely on Apache Flink.

Apache Flink’s powerful runtime

First, Apache Flink has a powerful runtime that provides very high resource efficiency, massive throughput with low latency, and robust state handling. Specifically, the runtime can:

  • Sustain a throughput in the order of tens of millions of records per second

  • Provide sub-second latency at scale

  • Guarantee exactly-once processing end-to-end across system boundaries

  • Compute correct results in the presence of failures and out-of-order events

  • Manage and – in the case of faults – recover state ranging into tens of terabytes

Apache Flink is very flexible and can be tailored to the needs of a wide range of workloads. For example, you can optimize for streaming workloads, batch workloads, or a combination of both.

Apache Flink’s APIs and language support

Second, Apache Flink comes with four different APIs, each tailored to different users and use cases. Flink also provides a range of programming language support, including Python, Java, and SQL. 

Flink features layered APIs at different levels of abstraction which offers flexibility to handle both common and specialized use cases

The DataStream API, available in Java and Python, has been around since Apache Flink’s pivot to stream processing in 2014/15. The DataStream API lets you create dataflow graphs by connecting transformation functions like FlatMap, Filter, and Process. Inside those user functions, you have access to the low-level building blocks of a stateful stream processor like state, time, and events. This gives you fine-grained control around how exactly the records are flowing through the system and how they read, write and update the state of your application. If you know the Kafka Streams DSL and Kafka Processor API (↔ ProcessFunction), it will feel quite familiar. 

The Table API is Apache Flink’s next-generation declarative API. It allows you to build programs via relational operations like joins, filters, aggregations, and projections, as well as various types of user-defined functions. Like the DataStream API, the Table API is available in Java and Python. The program is optimized like Flink SQL queries, and the API shares many other aspects with SQL including the type system, built-in functions, and the validation layer. This API can be compared to Spark Structured Streaming, Spark’s DataFrame API, or the Snowpark DataFrame API, although these APIs are focused on micro-batch/batch instead of stream processing.

Sharing the same foundation as the Table API, there is Flink SQL, an ANSI standard compliant SQL engine for processing both real-time and historical data. Flink SQL uses Apache Calcite for query planning and optimization. It supports arbitrarily nested subqueries, has broad language support including various streaming joins and pattern matching, and comes with an extensive ecosystem including JDBC Driver, catalogs, and an interactive SQL shell.

And finally, there is “Stateful Functions”, which simplifies the development of stateful, distributed event-driven applications. Stateful Functions is a sub-project of Apache Flink and distinctly different from the other Apache Flink APIs. It is most easily described as a stateful, fault-tolerant, distributed Actor system based on the Apache Flink runtime.

The breadth of API options makes Apache Flink the perfect choice for a stream processing platform. You can choose the API that works best for your language and use case, relying on a single runtime and shared architectural concepts. You can also mix APIs as your requirements and service evolve over time. In Part Three of this series, we’ll take a closer look at Flink SQL, which is the focus for the early access version of our fully managed Flink service.

Unified stream and batch processing

Third, Apache Flink unifies stream and batch processing. This means that all of Flink’s main APIs (SQL, Table API, and DataStream API) are unified and can be used to process  both bounded data sets and unbounded data streams. 

Specifically, you can run the same program in either batch processing mode or stream processing mode depending on the nature of the data that is being processed. You can even let the system choose the processing mode for you.

  • Only bounded data sources → Batch Processing Mode

  • At least one unbounded data source → Stream Processing Mode

Flink can unify stream and batch processing under the same umbrella

Achieving this has not only been an interesting technological challenge for the Apache Flink community, but comes with very real advantages for you as a user:

  • You benefit from consistent semantics across real-time and historical data processing use cases.

  • You can reuse code, logic, and infrastructure between real-time and historical data processing applications.

  • You can mix historical and real-time data processing in a single application (see my talk at last year’s Current).

Production-readiness

Fourth, Flink has been hardened in production for a long time and comes with a lot of production-readiness features, including:

  • A flexible metrics system that includes built-in reporters for popular tools like Datadog and Prometheus, but also lets you integrate with custom solutions.

  • Extensive observability, debugging and troubleshooting support in Flink’s Web UI, including back pressure monitoring, Flamegraphs, and thread dumps.

  • Savepoints, which allow you to statefully scale (i.e., while preserving exactly-once semantics), upgrade, fork, backup, and migrate your application over time. 

Apache Flink and Kafka – Better Together

Apache Flink joined the Apache Incubator in 2014, roughly 2 years after Apache Kafka graduated from it. Since its inception, Apache Kafka has been Apache Flink’s most popular connector. In many ways, Apache Kafka has paved the way for the adoption of Apache Flink, because in order to process streams, we need to store and serve the events somewhere! 

Apache Kafka was a particularly good fit for Apache Flink, because compared to other systems like ActiveMQ, RabbitMQ, or PubSub, Kafka allows consumers to persistently store data streams indefinitely. Additionally, consumers can read data streams in parallel and replay them as needed. The former matched Flink’s distributed processing model and the latter is a requirement for Flink’s fault tolerance mechanism.

Over the years, Flink has developed excellent support for building Kafka-based applications. Flink applications can use Kafka as both a source and a sink, leveraging the many services and tools of the Kafka ecosystem. Flink supports many popular formats out of the box, including Avro, JSON, and Protobuf. 

It's important to note that Flink doesn't store any data – the processing logic is operating on data that lives somewhere else. Flink acts as a compute layer for Kafka, powering real-time applications and pipelines, with Kafka providing the core streaming data storage layer.

Flink acts as the compute layer and Kafka acts as the storage layer within the data streaming stack

A table in Flink is backed by a topic in a Kafka cluster, and as events are added to the Kafka topic, they are appended to the table in Flink. Queries that reference the table are automatically updated, emitting or materializing their results. More details on this will be provided in the third blog of this series. 

Given such great synergy between Flink and Kafka, it's no surprise that many of the companies that were early adopters of Kafka are now adopting Flink. Leading innovators like Airbnb and Stripe have adopted both Kafka and Flink in their data stack, and if you’re a TikTok user, their Monolith AI recommendation system is built on Kafka and Flink.

Obviously, not everyone has access to the same resources and end-to-end capabilities that these companies have. At Confluent, we're building an integrated data streaming platform that unifies Kafka with Flink, so that users have easy access to everything needed to build a powerful data streaming architecture without taking on the operational complexity and burden. 

How can I learn more about Flink?

I’ve only scratched the surface of what there is to know about Flink and how to build stream processing applications on top of Kafka. If you want to go deeper into the details of how Flink works, we encourage you to check out our Flink 101 course on Confluent Developer.

In addition, we'll have a lot of great talks on both Flink and Kafka coming up at Current 2023, the premier data streaming conference taking place in San Jose on Sept 26-27th. Registration is still open if you’d like to hear from some of the world’s foremost Flink experts!

In the next installment of our blog series, we’ll take a look at common Flink use cases being implemented across different industries. We will also dive into what makes Flink’s extensive feature set uniquely suitable for this wide range of use cases. Read part two of the series: Flink in Practice: Stream Processing Use Cases for Kafka Users.

  • Konstantin is a member of the Apache Flink PMC, long-term contributor to the project and group product manager at Confluent. He joined the company early this year as part of the acquisition of Immerok which he had co-founded with a group of long-term community members earlier last year. Formerly, as Head of Product at Ververica, Konstantin supported multiple teams working on Apache Flink in both discovery as well as delivery. Before that he was leading the pre-sales team at Ververica, helping their clients as well as the Open Source Community to get the most out of Apache Flink.

Did you like this blog post? Share it now