Log Compaction – Highlights in the Apache Kafka^® and Stream Processing Community – March 2017

作成者 :

Gwen ShapiraEngineering Manager, Confluent

Mar 9, 2017読み取り時間: 5 min

Big news this month! First and foremost, Confluent Platform 3.2.0 with Apache Kafka^® 0.10.2.0 was released! Read about the new features, check out all 200 bug fixes and performance improvements and then download Confluent Platform 3.2.0 and try it out.

Thanks to Ismael Juma, there is already a plan for the next release of Apache Kafka – 0.11.0.0, so you can check out the features planned for June. The big ticket items are exactly-once and transactions, dropping support for Java 7, and disabling unclean leader election by default.

Notable KIPs this month include:

Voted:

KIP-107: Add purgeDataBefore() API in AdminClient – This KIP allows developers to request data purging from Kafka. This data cleanup is in addition to the usual cleanup policy which is time-based and size-based. The cleanup API is especially useful for multi-step stream processing jobs that can now remove intermediate data after it was processed by downstream jobs.
KIP-119: Drop Support for Scala 2.10 in Kafka 0.11 – We’ve added support for Scala 2.12 in Kafka 0.10.2.0, now it is time to remove the older version of Scala.
KIP-121: Add KStream peek method – A new stream DSL command. Similar to map(), but intended to produce side-effects rather than modify the events in the stream. This is useful for debugging and diagnostics: peek() can be used to update a monitoring metric or to print the current record, similar to Java 8’s Stream#peek() method.

Discussed:

KIP-129: Streams Exactly-Once Semantics – Now that adding exactly-once semantics and transactions to Kafka is in progress, it is time to add exactly-once processing semantics to Kafka’s Streams API.
KIP-122: Add Reset Consumer Group Offsets tooling – Ever had a consumer group fail on a bad record and wished you could just tell the consumer group to skip ahead a bit? So did we. Now we are discussing the best CLI to do it.
KIP-124 – Request rate quotas – Right now Kafka allows limiting the bandwidth that a client is allowed to produce and consume, but there is still no control over how much CPU resources a client is using. The functionality will be very useful for anyone running a multi-tenant cluster, and the discussion on how to best model CPU consumption of clients and the best ways to let administrators control it via a configuration is fascinating.
KIP-125: ZookeeperConsumerConnector to KafkaConsumer Migration and Rollback – We want to deprecate the old 0.8.x consumer in favor of the new consumer, but some teams have trouble migrating because there is no support for a rolling upgrade between the two consumer types. This KIP proposes a solution to this problem, allowing us to remove the old consumer.

Notable Blog posts:

WePay shared their data pipeline architecture – a microservices architecture that streams data between MySQL, Kafka and Google’s BigQuery. It is a very popular architecture for modern data pipelines, and WePay’s architecture leverages Kafka’s Connect API and the Confluent Schema Registry in their implementation. And since the blog post mentions Debezium’s MySQL Connector, take note that Debezium released their much awaited CDC connector for Postgres.
Amis Technology Blog published a beginner-friendly step-by-step guide to getting started with Kafka’s Streams API, and then they also blogged about an advanced use-case for top-n aggregation grouped by different dimensions.
Joining streams is really important for data enrichment use-cases like customer 360 and IoT. Codecentric explain the different options the Kafka Streams API has for joins with great visualization and details on how to use them. And if you are wondering how Kafka’s Streams API can be used for IOT, you may want to read this blog post on Kafka’s Streams API, IoT and wearable technology.
Of course, once you take your Kafka streams application to production, you will also want to know how to monitor Kafka streams applications using JMX.
We keep mentioning exactly once, but why is it such a big deal? Confluent co-founder, Neha Narkhede, answers.
You know what’s nice about building a versatile framework? Seeing all the different ways people use it. Some people are using the Connect API to get data from ftp to Kafka and others get their data from Oracle. Some people write data from Kafka to Splunk and others from Kafka to Elastic.
Oracle blogged on how to use Kafka’s Connect API and Confluent REST Proxy and Schema Registry to integrate Oracle’s cloud with Apache Kafka. But getting data from a database to Kafka can get complicated if you want to preserve transaction isolation, make sure you read about the possible issues and a suggested solution.
We published our first annual client survey. Take a look at how the community uses the various clients in the Kafka ecosystem.

Our Confluent Community Slack Channel is thriving – with 500 members and lively discussions on Apache Kafka and all ecosystem projects. The community is still new, but next month we’ll share highlights from the community discussions. You are invited to join.

And most important, we announced the agenda for Kafka Summit NYC and a Kafka Summit hackathon. We look forward to seeing all of you there! Register now!

Gwen Shapira is a Software Enginner at Confluent. She has 15 years of experience working with code and customers to build scalable data architectures, integrating relational and big data technologies. She currently specialises in building real-time reliable data processing pipelines using Apache Kafka. Gwen is an Oracle Ace Director, an author of books including “Kafka, the Definitive Guide”, and a frequent presenter at data related conferences. Gwen is also a committer on the Apache Kafka and Apache Sqoop projects.

このブログ記事は気に入りましたか？今すぐ共有

Powering AI Agents with Real-Time Data Using Anthropic’s MCP and Confluent

Mar 25, 2025

Model Context Protocol (MCP), introduced by Anthropic, is a new standard that simplifies AI integrations by providing a secure and consistent way to connect AI agents with external tools and data sources…