わずか5日間で Kafka スキルをレベルアップ | ストリーミングシーズンに参加
There was a huge amount of buzz about Apache Flink® at this year’s Kafka Summit London. From an action-packed keynote to standing-room only breakout sessions, it's clear that the Apache Kafka® community is hungry to learn more about Flink and how the stream processing framework fits into the modern data streaming stack.
That's why we're excited to introduce our new "Inside Flink" blog series that takes a deeper look at why developers and organizations everywhere are shifting their stream processing technologies to Flink. Our first blog post explains what Flink is and how it can enhance your streaming use cases running on Kafka. Future topics will include common Flink use cases, an inside look at Flink SQL, and much more.
If you’re interested in putting these topics into practice, we encourage you to apply to our Flink Early Access Program. The program gives Confluent Cloud users a chance to get their hands on an early version of our upcoming fully managed Flink service and help shape its roadmap – sign up for Confluent Cloud and apply today!
And with that, let's start off by breaking down the current stream processing landscape.
Apache Kafka has become the de-facto standard for writing, reading, and sharing data streams. Data streams let you easily share high-quality business facts across your organization, which of course has intrinsic value by itself. But data is far more valuable when you can clean and enrich it, making it usable for downstream systems while enabling you to derive insights and drive actions.
That’s where stream processing comes in. Stream processing allows you to continuously consume data streams, process them with your own business logic, and produce new data streams for others to use. Stream processing use cases cover a wide range of applications including low-latency data pipelines, real-time dashboards, materialized views, machine learning models, and event-driven apps and microservices. We’ll take a more detailed look at stream processing use cases in part two of this series.
The complexity of processing logic varies depending on the use case, ranging from simple operations like filters and aggregations to more complex tasks such as multi-way temporal joins and arbitrary event-driven logic.
Therefore, the benefits of switching from alternatives (periodic batch jobs, ELT, classical two-tiered architecture) to stream processing naturally differ. Still, in practice, I’ve usually seen at least one of the following aspects be a key driver of adoption:
Latency: Stream processing significantly reduces the time between the occurrence of an event and when it is reflected in your product and user experience – be that an app, dashboard, machine learning model, or user-facing analytics.
Innovation and reusability: Stream processing transforms data products into shareable data streams, consumed, transformed, and built upon by downstream systems and applications. Data streams form a set of reusable data primitives, providing well-defined and consistent data access for innovating and powering new products and ideas.
Resource efficiency and cost: Continuous data processing increases resource utilization by distributing work over time. In addition, processing data upstream (e.g., pre-aggregation, sessionization) dramatically reduces costs in downstream systems (e.g., data warehouse, real-time analytics database) while at the same time speeding up queries in these systems.
Expressiveness: Life doesn’t happen in batches. In contrast to periodically scheduled batch jobs, stream processing does not introduce artificial boundaries in your data which overlap into your processing logic.
As discussed in the previous section, a stream processor is as core to the streaming data stack as the database is to the traditional, data-at-rest stack. Let’s talk about Apache Flink in particular and why it has seen such widespread adoption in the last few years.
Apache Flink is a unified stream and batch processing framework. It has been a top-five Apache project for many years and is on its way to becoming the de-facto standard for stream processing. In fact, its growth rate is following a similar trend to Kafka from four years ago:
Flink has a strong, diverse contributor community backed by companies like Alibaba and Apple. It powers stream processing platforms at many companies, including digital natives like Uber, Netflix, and Linkedin, as well as successful enterprises like ING, Goldman Sachs, and Comcast.
Why did these companies choose Apache Flink over other stream processing frameworks? Let’s cover four reasons that companies rely on Apache Flink.
First, Apache Flink has a powerful runtime that provides very high resource efficiency, massive throughput with low latency, and robust state handling. Specifically, the runtime can:
Sustain a throughput in the order of tens of millions of records per second
Provide sub-second latency at scale
Guarantee exactly-once processing end-to-end across system boundaries
Compute correct results in the presence of failures and out-of-order events
Manage and – in the case of faults – recover state ranging into tens of terabytes
Apache Flink is very flexible and can be tailored to the needs of a wide range of workloads. For example, you can optimize for streaming workloads, batch workloads, or a combination of both.
Second, Apache Flink comes with four different APIs, each tailored to different users and use cases. Flink also provides a range of programming language support, including Python, Java, and SQL.
The DataStream API, available in Java and Python, has been around since Apache Flink’s pivot to stream processing in 2014/15. The DataStream API lets you create dataflow graphs by connecting transformation functions like FlatMap, Filter, and Process. Inside those user functions, you have access to the low-level building blocks of a stateful stream processor like state, time, and events. This gives you fine-grained control around how exactly the records are flowing through the system and how they read, write and update the state of your application. If you know the Kafka Streams DSL and Kafka Processor API (↔ ProcessFunction), it will feel quite familiar.
The Table API is Apache Flink’s next-generation declarative API. It allows you to build programs via relational operations like joins, filters, aggregations, and projections, as well as various types of user-defined functions. Like the DataStream API, the Table API is available in Java and Python. The program is optimized like Flink SQL queries, and the API shares many other aspects with SQL including the type system, built-in functions, and the validation layer. This API can be compared to Spark Structured Streaming, Spark’s DataFrame API, or the Snowpark DataFrame API, although these APIs are focused on micro-batch/batch instead of stream processing.
Sharing the same foundation as the Table API, there is Flink SQL, an ANSI standard compliant SQL engine for processing both real-time and historical data. Flink SQL uses Apache Calcite for query planning and optimization. It supports arbitrarily nested subqueries, has broad language support including various streaming joins and pattern matching, and comes with an extensive ecosystem including JDBC Driver, catalogs, and an interactive SQL shell.
And finally, there is “Stateful Functions”, which simplifies the development of stateful, distributed event-driven applications. Stateful Functions is a sub-project of Apache Flink and distinctly different from the other Apache Flink APIs. It is most easily described as a stateful, fault-tolerant, distributed Actor system based on the Apache Flink runtime.
The breadth of API options makes Apache Flink the perfect choice for a stream processing platform. You can choose the API that works best for your language and use case, relying on a single runtime and shared architectural concepts. You can also mix APIs as your requirements and service evolve over time. In Part Three of this series, we’ll take a closer look at Flink SQL, which is the focus for the early access version of our fully managed Flink service.
Third, Apache Flink unifies stream and batch processing. This means that all of Flink’s main APIs (SQL, Table API, and DataStream API) are unified and can be used to process both bounded data sets and unbounded data streams.
Specifically, you can run the same program in either batch processing mode or stream processing mode depending on the nature of the data that is being processed. You can even let the system choose the processing mode for you.
Only bounded data sources → Batch Processing Mode
At least one unbounded data source → Stream Processing Mode
Achieving this has not only been an interesting technological challenge for the Apache Flink community, but comes with very real advantages for you as a user:
You benefit from consistent semantics across real-time and historical data processing use cases.
You can reuse code, logic, and infrastructure between real-time and historical data processing applications.
You can mix historical and real-time data processing in a single application (see my talk at last year’s Current).
Fourth, Flink has been hardened in production for a long time and comes with a lot of production-readiness features, including:
A flexible metrics system that includes built-in reporters for popular tools like Datadog and Prometheus, but also lets you integrate with custom solutions.
Extensive observability, debugging and troubleshooting support in Flink’s Web UI, including back pressure monitoring, Flamegraphs, and thread dumps.
Savepoints, which allow you to statefully scale (i.e., while preserving exactly-once semantics), upgrade, fork, backup, and migrate your application over time.
Apache Flink joined the Apache Incubator in 2014, roughly 2 years after Apache Kafka graduated from it. Since its inception, Apache Kafka has been Apache Flink’s most popular connector. In many ways, Apache Kafka has paved the way for the adoption of Apache Flink, because in order to process streams, we need to store and serve the events somewhere!
Apache Kafka was a particularly good fit for Apache Flink, because compared to other systems like ActiveMQ, RabbitMQ, or PubSub, Kafka allows consumers to persistently store data streams indefinitely. Additionally, consumers can read data streams in parallel and replay them as needed. The former matched Flink’s distributed processing model and the latter is a requirement for Flink’s fault tolerance mechanism.
Over the years, Flink has developed excellent support for building Kafka-based applications. Flink applications can use Kafka as both a source and a sink, leveraging the many services and tools of the Kafka ecosystem. Flink supports many popular formats out of the box, including Avro, JSON, and Protobuf.
It's important to note that Flink doesn't store any data – the processing logic is operating on data that lives somewhere else. Flink acts as a compute layer for Kafka, powering real-time applications and pipelines, with Kafka providing the core streaming data storage layer.
A table in Flink is backed by a topic in a Kafka cluster, and as events are added to the Kafka topic, they are appended to the table in Flink. Queries that reference the table are automatically updated, emitting or materializing their results. More details on this will be provided in the third blog of this series.
Given such great synergy between Flink and Kafka, it's no surprise that many of the companies that were early adopters of Kafka are now adopting Flink. Leading innovators like Airbnb and Stripe have adopted both Kafka and Flink in their data stack, and if you’re a TikTok user, their Monolith AI recommendation system is built on Kafka and Flink.
Obviously, not everyone has access to the same resources and end-to-end capabilities that these companies have. At Confluent, we're building an integrated data streaming platform that unifies Kafka with Flink, so that users have easy access to everything needed to build a powerful data streaming architecture without taking on the operational complexity and burden.
I’ve only scratched the surface of what there is to know about Flink and how to build stream processing applications on top of Kafka. If you want to go deeper into the details of how Flink works, we encourage you to check out our Flink 101 course on Confluent Developer.
In addition, we'll have a lot of great talks on both Flink and Kafka coming up at Current 2023, the premier data streaming conference taking place in San Jose on Sept 26-27th. Registration is still open if you’d like to hear from some of the world’s foremost Flink experts!
In the next installment of our blog series, we’ll take a look at common Flink use cases being implemented across different industries. We will also dive into what makes Flink’s extensive feature set uniquely suitable for this wide range of use cases. Read part two of the series: Flink in Practice: Stream Processing Use Cases for Kafka Users.
Dive into Flink SQL, a powerful data processing engine that allows you to process and analyze large volumes of data in real time. We’ll cover how Flink SQL relates to the other Flink APIs and showcase some of its built-in functions and operations with syntax examples.
As of today, Confluent Cloud for Apache Flink® is available for preview in select regions on AWS. In this post, learn how we’ve re-architected Flink as a cloud-native service on Confluent Cloud.