[Webinar] Bringing Flink to On-Prem and Private Clouds. Register Now

Turbo-Charging Confluent Cloud To Be 10x Faster Than Apache Kafka®

Écrit par

At Current 2023, we announced that Confluent Cloud is now up to 10x faster than Apache Kafka®, thanks to Kora, The Cloud-Native Kafka engine that powers Confluent Cloud. In this blog post, we will cover what that means in more depth.

Running an “apples-to-apples” performance comparison between Apache Kafka and the fully managed Confluent Cloud can be tricky. For example, Confluent Cloud has additional essential services (e.g., observability, durability auditing, billing, logging—see the award-winning VLDB paper on Kora for more details) running in the background on each node of our fully managed SaaS, keeping our 30K+ clusters durable, secure, and highly available. These functions consume resources that have no equivalent in the open source platform, so the comparison slightly penalizes Confluent Cloud. Kora is playing with a hand tied behind its back, but we saw this as just another challenge on the road to deliver better performance to our customers. 

Building and improving Kora is a continuous process, but now we are happy to share that Confluent Cloud customers can:

  • Get their data to where it’s needed up to 10x faster with more predictable p99 latency across various workload profiles

  • Keep latency reliably low even under cloud provider infrastructure or service disruptions with Kora’s automatic monitoring and mitigation (no paging or any on-call involvement needed!)

  • Enjoy continuous performance optimizations without lifting a finger through seamless upgrades and automatic tuning

Figure 1: Confluent Cloud shows more than 10x tail latency improvements over Apache Kafka at 5.6GBps aggregate throughput (1.4GBps ingress + 4.2GBps egress)

The remainder of this blog post is structured as follows. 

  1. First, we will walk through how we measure latency in Confluent Cloud and the need to improve latency for our customers both under steady state and in more real-world conditions (exposed to external factors such as workload changes, infrastructure degradation, and maintenance upgrades). 

  2. Second, we will share our learnings to improve steady-state latency and benchmarks comparing Confluent Cloud and Apache Kafka. 

  3. Finally, we will conclude with our learnings from improving latency fluctuations due to various external factors and show real customer examples of where this has shown benefit. 

Measuring latency in Confluent Cloud

The first step in reasoning about performance is deciding what to measure and how to do it. Benchmarks often measure performance in a lab environment with synthetic workloads, which may not fully reflect what customers experience in production. While truly observing customers’ experience requires client-side metrics (which we hope KIP-714 will address), we use health check services as a close approximation in Confluent Cloud. Health checks are part of Kora and rely on Apache Kafka producers and consumers to periodically produce and consume messages through the same path the user’s data would travel through. These services are used to measure end-to-end (E2E) latency—the aggregated time it takes for a producer to send a message and the consumer to read it—and to alert us when the observed latency breaches our internal SLOs. 

Figure 2: E2E latency measurement in Confluent Cloud health checks

Having run a managed cloud SaaS offering for many years observing customer requests and our internal SLOs week over week, it is abundantly clear to us that there is a distinction between optimizing for latency under steady state vs latency fluctuations caused by external factors (e.g., due to network jitters, noisy neighbors, managed cloud service degradations, zonal failures, software upgrades, workload fluctuations, etc.). While most performance benchmarks focus on showcasing improvements to steady-state latencies, in the real world it’s latency fluctuations that often cause significant disruptions to customers’ true performance experience. We invested in improving both to make sure Confluent Cloud’s latency is low, and stays consistent, in any scenario. 

Now let’s get to what you really want to see—benchmarks against Apache Kafka, and our learnings from the results. 

Steady-state benchmark results: Confluent Cloud (Kora) vs. Apache Kafka

We benchmarked many of the workload patterns we see in Confluent Cloud. Let’s showcase a few examples of the results we saw:

  • With the same hardware setup, Confluent Cloud’s Kora engine outperforms Apache Kafka by up to 16x, as measured by p99 E2E latencies, across various workload profiles with aggregate throughput (ingress+egress) from 30 MBps to 5 GBps+

  • Confluent Cloud’s latencies stay low and more stable as throughput and partition scales, while Apache Kafka’s performance degrades significantly at heavier loads

  • Confluent Cloud’s performance is more predictable even at tail percentiles

Note
Keep in mind that while benchmarks are a guiding light, your mileage may vary depending on your workload. Our commitment is to continuously improve Kora and seamlessly deliver these improvements behind the scenes focusing on what our customers run. If you encounter scenarios where Confluent Cloud isn't meeting your performance standards, we invite you to inform us so that we can work towards innovating on your behalf. 

Benchmark setup

We ran our benchmarks in Confluent Cloud’s dedicated offering which abstracts the underlying infrastructure in terms of CKUs (Confluent Unit for Kafka). CKUs provide consistent cross-cloud limits on various dimensions of Apache Kafka so that users don’t have to think of underlying hardware but instead think in terms of their workload to CKU mapping and scale up/down their CKUs as necessary. For the purpose of this benchmark, we tested the dimensions of ingress/egress and partition limits of Confluent Cloud against similarly resourced Apache Kafka clusters at 2 different scales to have workload variations across throughput and number of topic-partitions. Though we also tested different fanouts, keyed workloads, and different producer/consumer counts which can impact the latency experienced, we won’t list all of them here for brevity. In a majority of those use cases, Confluent Cloud outperforms Apache Kafka, whereas in other cases there is more work cut out for us. The limits for the different scales we tested are as follows. 

CKU

Ingress limit

Egress limit

Partitions limit

2

100 MBps

300 MBps

9000

28

1400 MBps

4200 MBps

100000 

Figure 3: Confluent Cloud’s CKU limits

To compare against Apache Kafka, we provisioned a similar hardware setup as that of Confluent Cloud (in AWS) with an equivalent number and type of broker compute, EBS storage, and client instances. The number of brokers per CKU and the size of the instance type also vary quite a bit within Confluent Cloud—we use various instance types to satisfy the limits and abstractions at various CKU counts. We also increased the disk size, IOPS, and throughput of EBS for Apache Kafka to be on par with Confluent Cloud for the tests. We also ensured that there were not any client bottlenecks by scaling up the clients along with the number of CKUs being tested. 

All tests were run on Apache Kafka 3.6 with KRaft. The KRaft controllers were running in three separate instances. To compare against higher CKU counts in Confluent Cloud, the Apache Kafka server-side configs for the number of replica fetchers, network threads, and I/O threads were increased to num.replica.fetchers=16, num.network.threads=16, num.io.threads=16. Using default configs would have made the Confluent Cloud improvements much higher. 

Unless otherwise noted, our benchmarks used the following client configurations and other specifications:

Configuration/Workload specification

Setting

acks

all

Fanout

1:3 (1 producer per topic, 3 consumer groups per topic, 1 consumer per group)

Inter-broker SSL

enabled

Message size 

2 KB

linger.ms

10ms

batch.size

16 KB (Java client default)

Warmup time

30 minutes

Test duration

1 hour

Figure 4: Default configuration and workload specification for benchmarks in this blog post

We invite you to try out these workloads and configs for any of our tests on Open Messaging Benchmark in the eu-central-2 AWS region in Confluent Cloud where this is available by default. 

Test 1: Consistent latency improvement as throughput scales

We started the benchmarking exercise by examining Confluent Cloud’s performance consistency as throughput scales at fixed partition counts. By achieving stable latency profiles across different levels of loads, we help ensure latency is consistent and predictable whether users are supporting small workloads or handling massive traffic spikes like Black Friday events.

We experimented across 2 different CKU counts, 2 and 28 CKUs, which correspond to different customer use cases we see in Confluent Cloud in their data streaming evolution. We varied the throughput up to the limit each CKU supports, keeping the total number of partitions at each CKU constant (partitions per topic were always constant at 200). The throughput varied from 10 MBps to 1.4 GBps ingress (which is 30 MBps to 5.6 GBps aggregate). We see that Confluent Cloud significantly outperforms Apache Kafka as throughput scales across the different CKUs with up to 12x improvements.

Figure 5: Confluent Cloud’s latency remains consistent as throughput scales and performance advantages expand to 12x

CKU

Throughput (ingress)

Apache Kafka p99 E2E latency

Confluent Cloud (Kora) p99 E2E latency

Percentage improvement

2 CKU 


(9000 partitions, 45 topics)

10 MBps

69 ms

42 ms

+64%

50 MBps

110 ms

60 ms

+83.3%

100 MBps

598 ms

242 ms

+147.1%


28 CKU 


(100000 partitions, 500 topics)

250 MBps

41 ms

25 ms

+64%

500 MBps

45 ms

26 ms

+73%

1 GBps

86 ms

42 ms

+104%

1.25 GBps

521 ms

53 ms

+893%

1.4 GBps

943 ms

73 ms

+1191.8%

Figure 6: p99 E2E latency comparison between Apache Kafka and Confluent Cloud with various throughput profiles with topic partitions fixed in 2 and 28 CKU (lower is better)

Test 2: Consistent latency improvement as partitions scale

We also altered the partition count per topic dimension in 2 and 28 CKU while keeping the throughput consistent to see whether partition count changes will affect our system’s performance. Confluent Cloud outperforms Apache Kafka with up to 11.5x performance improvements. 

CKU

Partitions

Apache Kafka p99 E2E latency

Confluent Cloud (Kora) p99 E2E latency

Percentage improvement

2 CKU 


(100 MBps, 45 topics)

6750

345 ms

244 ms

+41.6%*

7875

511 ms

168 ms

+204.1%*

28 CKU 


(1.4 GBps, 500 topics)

75000

426 ms

58 ms

+640.9% 

90000

777 ms

67 ms

+1059.7% 

Figure 7: p99 E2E latency comparison between Apache Kafka and Confluent Cloud as partition scales with throughput fixed in 2 and 28 CKU (lower is better)

*Note that at low partition count per CKU and high throughput, background tiering activity interferes with foreground produce/consume processing. Apache Kafka was run without KIP-405 (Kafka Tiered Storage) since that isn’t default yet, and hence this is not an apples-to-apples comparison. Though we haven’t seen this scenario in production (where all partitions tier at the same time), we are working on further improving QoS to minimize the impact of background operations on foreground processing as we evolve our architecture.

Test 3: Predictable lower tail latency

We have been using p99 E2E latency across our benchmarks because it is the most common requirement from our latency-sensitive customers. We believe that performance benchmarks are only reliable if they use metrics that reflect what users actually measure in their daily work. 

Since p95 or average latency may also be relevant for less latency-sensitive users, we prepared the results at different percentiles below using the 28 CKU (1.4 GBps/500 topics) scenario. As shown in Figure 8, Confluent Cloud’s E2E latency stays low even at tail latencies (16x more superior), showing that its performance is more stable and predictable than Apache Kafka.

Figure 8: Confluent Cloud’s performance is more stable and predictable than Apache Kafka, as its tail latency stays consistently low even at higher percentiles

Kora is a fresh take on what Apache Kafka can be in the cloud. Since we published our VLDB paper, our architecture has been evolving continuously. Kora has had some fundamental changes all across the stack for improved performance, including but not limited to increased batching and parallelism, out-of-order processing, and a new replication protocol. These improvements have happened in tandem with a continuously evolving security, durability, and reliability posture to deliver Kora at a competitive price for our customers. While steady-state improvements are great, the even harder part is delivering consistent performance in the real world. Let's talk more about that in the next section.

Beyond the Benchmarks: Optimizing for Real-world Performance

When planning to choose an Apache Kafka offering in the cloud, most users may just do a steady-state benchmark comparison like the ones above, and go with the vendor that can satisfy their requirements. However, most performance benchmarks only showcase results for “when things go well” while overlooking day two operations like network jitters, service degradations, or zonal failures which can contribute to a significant number of disruptions to latency.

Our philosophy towards performance was not just to look good on paper, but rather to stay persistent and predictable in the real world through any type of disruptions. We’ll cover some of the continuous optimizations we‘ve implemented over the years. 

Improvements to Self-Balancing

Self-Balancing Clusters have been a staple of Confluent Cloud’s elasticity. As part of the efforts examining and optimizing Kora’s brokers for latency, we saw an opportunity to further improve the speed at which our self-balancing algorithm reacts to customer workload fluctuations. For example, a customer who uses a single Confluent Cloud cluster across different internal organizations (i.e., multi-tenanted) could see their aggregated performance altered with any sudden change in any internal organization’s behavior. The original algorithm was very heavy-weight in that it tried to balance the cluster for multiple variables like partition distribution and capacity constraints (ingress, egress) in a single iteration. This was inefficient because the balancing consumed a lot of CPU, it led to large rebalances which took longer (by the time the rebalance finished the workload could have further changed), and some of the variables didn’t correlate close enough to warrant being done together (i.e, ingress/egress distribution could be different than partition distribution). Our enhanced self-balancing feature was completely re-architected to be more real-time and lightweight to ensure it reacts swiftly to workload changes.

Figure 9: How Self-Balancing Clusters work
Figure 10: Self-balancing in action. The workload-induced latency variations across different brokers come down significantly after our updates to the balancing algorithm.

Automatic infrastructure degradation detection and remediation 

Cloud infrastructure and services experience degradation all the time. To minimize the impact on latencies during these degradations, we have a robust pipeline that automatically and reliably:

  1. Detects when a broker is experiencing degraded hardware.

  2. Takes the broker out of the critical path by draining the broker of all partition leadership and in-sync replica (ISR) membership; we call this broker demotion. This prevents the degraded broker from impacting latencies as none of its partition replicas form part of the ISR (but still get replicated to asynchronously). Note that we do have safeguards in place to ensure that there is no availability/durability loss due to the loss of an AZ in this mode. 

  3. Once the underlying hardware issue is resolved, the replicas on the broker that lost leadership are restored to normal operational mode and are reelected if doing so respects the self-balancing cluster optimal leader placement.

It should be noted that this mechanism protects against latency spikes; issues that affect redundancy are treated differently by moving offline replicas to another broker to maintain the replication factor.

Figure 11: Process flow of automatic infrastructure degradation monitoring, aggregation, and mitigation

Example of handling latency spikes with infrastructure degradation monitoring in the fleet: Consider a multi-tenant Kora instance where a storage degradation caused a corresponding spike in the health-check monitoring around 10:46 a.m. in Kafka-4, the fifth broker of an 8-broker cluster. Note that Kafka-3 and Kafka-2’s latency are also affected due to them being in the ISR. This is a concern in any leader-based replication protocol as any infrastructure degradations can cause performance issues but not be enough to trigger failure detection. As a result, a more statistical notion of failure detection is needed to detect and automatically remediate these failures. Around 10:50 a.m., the process explained above detects and mitigates the problem through broker demotion and brings the latency down automatically—no paging, no alarms, no on-call involvement, completely automatic. Kafka-4 has no data points after 10:50 a.m., since it no longer hosts any partition leadership, until 11:05 a.m. when it is restored (the observability platform graphs this as a downward straight line between those two data points). This same mechanism has triggered thousands of times across our fleet in the last month alone.

Kafka health check E2E latencies (p99, ms)

Figure 12: A storage degradation causing latency spike was automatically detected and mitigated within minutes as shown on Confluent Cloud health check dashboard

Seamless updates

As a managed service we continuously update our customers’ clusters for various reasons including but not limited to security patches, new features, and better hardware. With Apache Kafka, the maintenance and upgrade process can be quite disruptive as it entails leadership movements (i.e., moving the leaders out of a broker to patch it) on the server side which requires a series of steps including processing metadata updates, truncating the log (if necessary), restarting replica fetchers, and so on. Clients also have to rediscover the new leader. This process repeats when the upgraded broker comes back up and the leadership migrates back to the upgraded broker (which could cause disruptions again). 

We have worked on improving this problem in Kora over the last year to decrease the mean time to synchronize partition leadership changes between leaders, followers, and clients by improving the metadata propagation protocol, parallelizing leadership change application, and carefully orchestrating leadership movement to warm up new brokers. We show below a benchmark of doing an upgrade (roll) in Apache Kafka vs Kora. We are continuously improving on making this better across various scenarios as this is a death-by-a-thousand-cuts kind of problem to ensure there is no need for maintenance windows in the first place and all patching is handled behind the scenes with no or minimal impact on the workload.

Simulation setup: We simulated a roll in Apache Kafka by restarting brokers in sequence as a normal patching process would proceed when a workload was running using the Open Messaging Benchmark workload. Confluent Cloud was running 2 CKUs while Apache Kafka was running 3.6 (KRaft mode) with the same configuration as explained in the steady state benchmark section. The workload here has 9000 partitions across 10 topics.

As you can see, the latency spikes up during the roll of every broker in Apache Kafka (sometimes up to multiple seconds) whereas Confluent Cloud is relatively seamless with only occasional spikes. Note that since client behavior also significantly influences the latencies observed during a roll, it is always a good idea to upgrade clients to the latest and use ones that are supported widely.

Figure 13: For Apache Kafka, upgrades caused visible latency spikes during the rolls for each broker. In comparison, upgrades’ impacts on Confluent Cloud latency were minimal.

Final

Hopefully, this gave you an interesting look into our ongoing performance journey in Confluent Cloud. All of this hard work wouldn’t be possible without the devoted support from all of our Kora engineering team, especially:

Prince Mahajan, Niket Goel, David Mao, Fred Zheng, Dimitar Dimitrov, Ashwin Sangem, Badai Aqrandista, Crispin Bernier, Chris Flood, Harish Sharma, Elena Anyusheva, Ning Shan, Shimiao Zhang, Srinivas Akhil Mallela, Yuying Li, Jason Gustafson, Josh Hanson, Yun Fu, Xavier Leaute, Adithya Chandra, Rittika Adhikari, Michael Borokhovich, Nimisha Mehta, Feng Min, Gopi Attaluri, Jose Garcia Sancio, Mayank Narula, Scott Hendricks, Vikas Singh, Lingnan Liu, Ismael Juma, Dhruvil Shah, Nikhil Bhatia, Anil Sharma, Anna Povzner, Tiffany Jianto, Yi Ding, Arun Singhal, Jon Chiu, Sameer Tejani, Harish Vishwanath, Giuseppe Spada, David Stockton, Praveen Balasubramanian. 

Engineering a data system to lower latency is not just about running profilers, optimizing code and algorithms, and running performance benchmarks to measure the improvements. These things are important but only half the story; the other half is ensuring that the software, in a production setting, also performs well with predictable latencies. Production is where hardware fails, or worse, degrades; where clusters need to get rolled for security patches and upgrades, and where we run a suite of monitoring and security software that can compete for resources. This other half doesn't appear in benchmarks and it is where operational excellence comes into play.

For us, there’s nothing better than a customer waking up to latencies being improved overnight:

“Speed and performance are one of the reasons that we chose Confluent in the first place. One time, we saw a huge improvement, 8x in fact, in latency and cluster load overnight. We actually thought something was wrong with our Confluent Cloud cluster and opened a support ticket. Turned out it was the rollout of a Kora engine performance update. We were able to get so much more out of the capacity that we were able to shrink our cluster and save some money with no impact to our applications.” - BMW Group

We are continuously evolving our architecture and innovating on our customers’ behalf to provide a better price and performance with the Kora engine. Stay tuned and we can’t wait to share all the exciting things we’re working on in 2024. 

Want to take the now turbo-charged Confluent Cloud for a spin? Sign up for free!

  • Marc Selwan is the staff product manager for the Kora Storage team at Confluent. Prior to Confluent, Marc held product and customer engineering roles at DataStax, working on storage and indexing engines for Apache Cassandra.

  • Shriram currently leads the Kafka Data Infrastructure team at Confluent.The team is responsible for delivering Apache Kafka to be the fastest and most cost efficient cloud native service for our customers. Prior to joining Confluent, he worked at AWS for seven years building a cloud native relational database service called Amazon Aurora.

Avez-vous aimé cet article de blog ? Partagez-le !