Level Up Your Kafka Skills in Just 5 Days | Join Season of Streaming
In Part One of our “Inside Flink” blog series, we explored the critical role of stream processing and why developers are increasingly choosing Apache Flink® over other frameworks.
In this second installment, we'll showcase how innovative teams across every industry and size are putting stream processing into practice – from streaming data pipelines to train ML models or more timely analytics to fraud detection in finance and real-time inventory management in retail. We’ll also discuss how Flink is uniquely suited to support a wide spectrum of use cases and helps teams uncover immediate insights in their data streams and react to events in real time.
If you’re interested in trying one of the following use cases yourself, be sure to enroll in the Flink 101 developer course by Confluent. This course provides a comprehensive introduction to Apache Flink and includes hands-on exercises and walkthroughs.
Flink is the ideal platform for a variety of use cases due to its versatility and extensive feature set across a number of key functions. Its layered APIs enable developers to handle streams at different levels of abstraction, catering to both common and specialized stream processing needs. Additionally, its large and active community of developers provide support through forums, mailing lists, and other channels, ensuring that Flink is constantly evolving and improving to meet the needs of different teams.
There is an endless array of use cases powered by stream processing, but we can bucket them into three main categories:
There is some nuance to how Flink and stream processing fit in each of these categories, so let’s dive into them to better understand their differences.
The first category that we’ll cover is event-driven applications, which automate responses to events as they happen. Event-driven applications are commonly used in industries such as finance, healthcare, and transportation, where a single event can drive many different asynchronous business functions. These applications are capable of processing data streams at scale and reacting to either each individual event, or an aggregation of events (often over a time window). They are also able to recognize patterns and react in a timely manner by triggering computations, state updates, and external actions.
This enables us to implement some important use cases:
Fraud detection: analyzing transaction data and triggering alerts based on suspicious activity. For example, identifying if a transaction is likely to be fraudulent when a customer pays with a credit card by comparing with transaction history and other contextual data (having a sub-second process latency in place is critical here).
Anomaly detection: detecting unusual or unexpected events that can help businesses (and customers) identify potential issues and take corrective action. For example, analyzing log activity to identify potential security breaches early to take proactive measures to prevent cyber attacks.
Alerting/notifications: immediately triggering an alert or an action when a threshold is met. For example, notifying a person or service that a stock price has reached a certain threshold.
Real-time routing: processing incoming events and routing them to the appropriate processing units or applications based on predefined rules. For example, routing an e-commerce order to the geographically closest warehouse to ensure it's fulfilled and delivered quickly.
Business process monitoring: monitoring the flow of events in real time to identify bottlenecks, inefficiencies, and other issues that may be impacting the business process. For example, one can identify workflows that have not progressed for a given time. Additionally, reminders can be sent for unfinished activities, such as an abandoned cart scenario.
IoT applications: using sensor data to trigger immediate actions, such as turning on lights or adjusting temperature based on changing local conditions. For example, stream processing helps to monitor sensor data from industrial equipment sensors for predictive maintenance.
In real-time event-driven applications, time is a critical component. Flink offers advanced windowing capabilities, giving developers fine-grained control over how time progresses and data is grouped together for processing:
Customizable window logic: Flink supports time-based and session-based windows, allowing developers to specify the time interval (e.g., every 10 ms, minute, etc.) or number of events in each window. Time-based windows enable the user to emit data at regular intervals, while session-based windows are useful for aggregating events arriving at irregular frequencies, for example to group website clicks into a single user session.
Per event, stateful processing: Flink's over aggregation in SQL, or Process functions, enables real-time processing, allowing immediate computation of each event in the context of the entire stream. For instance, detecting if the current transaction is greater than the highest transaction seen in the last 30 days for each user and triggering an alert immediately without waiting for any time window completion.
Complex event processing (CEP): CEP is a Flink library for detecting patterns of events in a stream of data. You can define complex event patterns in a simple and declarative way. For example, the following code snippet shows how to detect "double bottom” financial patterns in a few lines of very readable SQL with the library’s MATCH_RECOGNIZE
function.
Many of Flink's features are centered around time and watermarks. Flink provides a variety of advanced time and windowing capabilities that enable accurate and efficient processing of streaming data. This is critical for event-driven applications, as well as the two other use cases we describe below.
The second key category of use cases for Flink is real-time analytics, also commonly known as streaming analytics. This is the process of analyzing real-time data streams to generate important business insights that inform both operational and strategic decisions. This approach enables organizations to analyze data as soon as it arrives from the stream and make timely decisions based on the latest, up-to-date information (rather than storing it in a database/data warehouse and analyzing it later when it’s potentially stale).
Here's a great example of a Flink-powered real-time analytics dashboard for UberEats Restaurant Manager, which provides restaurant partners with additional insights about the health of their business, including real-time data on order volume, sales trends, customer feedback, popular menu items, peak ordering times, and delivery performance.
Here are some other examples of real-time analytics use cases:
Ad/campaign performance: tracking and analyzing the performance of ads or marketing campaigns in real time (e.g., click-through rates, conversion rates, engagement rates) to dynamically adjust strategies, targeting, and spend.
Content performance: tracking content performance, such as a media company monitoring the performance of their video lineups and determining which topics or formats are resonating most with their audience.
Quality monitoring of Telco networks: monitoring the quality of service of large networks in real time to enable operators to improve user experience and proactively prevent quality degradation or downtime.
Pattern analysis: identifying patterns in real time, such as social media companies identifying trending topics or influencers. The analytical results can then be used in selecting high-relevance content to show the user.
Flink is equipped with numerous advanced analytics capabilities for analyzing massive datasets in real time. This allows Flink to be used for a diverse range of analytics applications at any scale. Here are some of the key features that Flink provides for real-time analytics:
Low-latency processing: Flink is designed to process large amounts of data with very low latency (sub-second). High processing latency can lead to delayed or flawed decision-making, such as missing profitable opportunities on a stock trading platform.
Machine learning: Flink includes a library called Flink ML that provides a variety of machine learning algorithms, such as linear regression, logistic regression, and k-means clustering. These algorithms can be used to build predictive models and perform classification and clustering on the latest and greatest streaming data.
Time-series analysis: Flink's CEP library, discussed in the last section, also supports time-series analysis of data over time. For example, users can define a pattern that looks for a sudden increase in temperature over a certain period of time.
Materialized views: Flink's dynamic tables, powered by continuous queries, provide an efficient way to always get up-to-date results, which can power dashboards or applications requiring continuous updates. For example, an e-commerce site can provide an always up-to-date view of product availability.
Interactive queries: Flink SQL, which we’ll dive deeper into in our next blog post, allows you to perform advanced analytics on streaming data using familiar SQL syntax. This can be used for real-time dashboards, reports, and other types of interactive queries.
Support for both stream and batch processing: Flink provides a unified API that supports both stream and batch processing modes. This means that developers can use the same SQL queries for both batch and streaming data processing without the need for rewriting code.
In summary, Flink is a versatile and powerful platform that supports both stream and batch processing, with advanced analytics capabilities to handle large amounts of data in real time.
Our third and final category of use cases is streaming data pipelines. These pipelines continuously ingest data streams from applications and data systems, perform joins, aggregations, and transformations to create new enriched and curated streams of higher value. Downstream systems can consume these events for their own purposes, starting from the beginning of the stream, the end of the stream, or anywhere in between. With streaming, these consumer systems can reflect the organization’s current state of affairs, as opposed to stale states based on periodically updated, hourly batch jobs.
Some common patterns of streaming data pipelines include:
ML pipelines: preparing the data and streaming it to object storage (such as S3) for periodic training of ML models. Once trained, they can be continuously and incrementally updated using the same (or other) streams, refining recommendations in real time. Additionally, historical stream data can be reprocessed to extract new features for expanded model capabilities.
Data curation pipelines: ingesting streaming data, processing it in real time, and curating it for downstream use. A social media company can use these pipelines to create a materialized view that computes the most popular posts and hashtags for each user based on their current and past activity. This materialized view can then be used for a content recommendation engine.
Cloud database pipelines: using techniques such as Change Data Capture (CDC) to generate data streams from legacy systems like legacy databases. These data streams can then be processed with Flink to create materialized views or shared with clients who can take advantage of real-time feeds for timely action.
Cloud data warehouse pipelines: migrating data from traditional data warehouses to modern cloud-based data platforms. Streaming data pipelines let you leverage high-value data from legacy sources and connect them to new endpoints, such as porting data from an on-prem data cube to a modern data warehouse such as Amazon Redshift. You can gradually migrate your mission-critical use cases to the new data warehouse while keeping your current operations intact.
One of the use cases for streaming data pipelines worth highlighting in more depth is augmenting GenAI and Large Language Models (e.g., GPT). These models are pre-trained on large amounts of data, which allows them to generate responses to user queries that are highly accurate and natural sounding. However, these models can still benefit from additional context to improve their accuracy and relevance to specific use cases for an individual user or business.
To enable prompt engineering, developers can use a streaming data pipeline to continuously generate the context needed to improve the model's output. For example, if a user searches for "restaurants near me," a streaming data pipeline can use Flink to process the user's location data and search history to generate context, such as the user's preferred cuisine and price range. This context can then be injected into the GPT model using prompt engineering, which can generate more accurate and relevant responses, such as recommending specific Italian restaurants that match the user's preferences.
Check out the GPT-4 + Streaming Data = Real-Time Generative AI blog post to learn more about how stream processing can enhance your GenAI use cases.
Flink and Kafka can easily connect to and from different systems to construct streaming data pipelines, combining data from different sources or tables, and applying real-time transformations on-the-fly using a versatile set of developer APIs:
Stream transformation: Flink's APIs, which we gave an overview of in Part One of the series, provide a comprehensive set of tools to address the most common data transformation and enrichment requirements, from simple to advanced transformations.
Enriching and combining streams with JOINs or Connected Streams. Flink supports time-based JOINs, as well as regular JOINs with no time limit, which enables joins between a data stream and data at rest or between two or more data streams. Support for versioned joins, as illustrated below, ensures that data is joined based on the version available at the time of the events. This feature is useful for joining a transaction with the currency rate, for example.
Exactly-once semantics (EOS): Flink supports end-to-end exactly-once semantics, which means that each event is processed only once, avoiding the creation of duplicates even in the event of failure or restart. This guarantee is important for many scenarios like financial pipelines or any pipeline or event-driven application that triggers actions, as executing these actions multiple times is not permitted.
As a Kafka user, you may have noticed that Flink’s common use case types – event-driven applications, real-time analytics, and streaming data pipelines – are very similar to Kafka use cases. That’s because Flink and Kafka are commonly used together to support various workloads, with Flink serving as the compute layer and Kafka as the storage layer. At Confluent, we are building a complete data streaming platform that tightly integrates Kafka and Flink, as well as important governance and security features, to support the most demanding and mission-critical stream processing use cases.
We’ve provided an overview of some of the common use cases for stream processing with Flink, but there is obviously more you can learn to actually implement your own use case(s). To learn about Flink's inner workings in more detail and put the framework into practice, we highly recommend checking out our Flink 101 course on Confluent Developer.
Moreover, we have an exciting lineup of great talks on real-world use cases powered by Flink and Kafka at Current 2023, the leading data streaming conference taking place in San Jose on September 26-27th. Register now to learn from some of the world's top Flink experts!
In the next blog post in our series, we dive into the topic of building streaming applications rapidly using Flink SQL. We will also explore how it relates to the other Flink APIs and showcase some of its built-in functions and operations. Read part three here: Your Guide to Flink SQL: An In-depth Exploration.
Learn why stream processing is such a critical component of the data streaming stack, why developers are choosing Apache Flink as their stream processing framework of choice, and how to use Flink with Kafka.
As of today, Confluent Cloud for Apache Flink® is available for preview in select regions on AWS. In this post, learn how we’ve re-architected Flink as a cloud-native service on Confluent Cloud.