Streaming Updates through Complex Operations in Kafka Streams at Scale

« Kafka Summit London 2022

With Kafka Streams, you can build complex stream processing topologies, including Joins and Aggregations over data streams. Making these complex processing pipelines aware of updates (including deletions) often becomes difficult, especially for previously joined and aggregated data. Because of the sheer amount of data, re-aggregating or re-joining from scratch at some time to handle updates correctly is not desirable.

In practice, these updates play an important role in stream processing, for example, to continuously improve data quality, to ensure data privacy, or to handle late-arriving data. This talk explores how we efficiently handle these stream updates and deletions in consecutive joins with Kafka Streams. Furthermore, we present an optimization for the aggregate operation in Kafka Streams, leveraging state stores to handle updates in complex aggregates. We discuss the challenges we encountered running complex stream processing topologies on Kubernetes and explore the solution with hands-on experiments.

Finally, we demonstrate how splitting the stream processing topologies enables us to have more fine granular control over resource allocation and scalability of different consecutive processing steps. We show how this improves cost-efficiency through autoscaling and the overall manageability of such streaming pipelines.

Moderator

Victor Künstler

bakdata GmbH

Victor Künstler is a software engineer at bakdata. He works on data streaming and Big Data solutions with Apache Kafka and develops solutions combining data stream processing with ML for continuous inference. He enjoys building resource-efficient and scalable distributed systems using open source technologies.

Streaming Updates through Complex Operations in Kafka Streams at Scale

Moderator

Victor Künstler

Related Links

How Confluent Completes Apache Kafka eBook

Leverage a cloud-native service 10x better than Apache Kafka

Confluent Developer Center

Spend less on Kafka with Confluent, come see how