[Webinar] Bringing Flink to On-Prem and Private Clouds. Register Now
Elasticity is one of the table stakes for a mature cloud service. Elasticity enables the addition of capacity in a cloud service as needed to meet spikes in demand, and subsequent reduction during periods of low usage. This allows you to align your consumption with business demands and keep costs optimal while ensuring you have the required capacity at all times.
Confluent Cloud now supports elastic Apache Kafka® clusters, allowing you to expand and shrink dedicated clusters in terms of CKUs (Confluent Unit for Kafka)—the unit of capacity for dedicated Kafka clusters in Confluent Cloud. This blog post covers the details of this capability in Confluent Cloud.
Resizing Kafka clusters in Confluent Cloud is fully self-serve, giving you the control to manage the capacity of your clusters to align with your needs. This capability is available through Confluent Cloud’s UI, CLI, and public APIs so you can also handle this task programmatically.
Along with the ability to expand and shrink a Kafka cluster, we recently added a cluster load metric that presents the utilization of your Kafka cluster. Understanding cluster utilization is critical before you decide to resize your cluster so that you do not adversely affect performance. Monitoring the cluster load can alert you to add capacity when load exceeds a certain critical threshold, say 70%. You can choose to continue to operate the cluster at a high load, but doing so will lead to increased producer/consumer latency and throttling. Conversely, when load drops below a certain threshold, you might consider reducing the capacity of the cluster to reduce cost.
Before diving into the details of the user experience around these new features, there are some key points to note:
The following looks at how cluster resizing works.
You can monitor cluster load either on the Confluent Cloud UI, or by using the Metrics API. For the cluster shown below, the cluster’s load is at 44%. Additionally, you can also look at the historical cluster load on the cluster dashboard to verify that the current load on the cluster has been sustained for some time rather than being a temporary spike.
If you anticipate adding some more workloads to this cluster, you might decide to expand the cluster to prepare for this increase in demand. To do so, you would navigate to the “Cluster overview → Cluster settings → Capacity” tab. This tab shows the current capacity of the cluster along with the usage for each CKU dimension, as well as the cluster load.
To expand or shrink the cluster, click on “Adjust capacity” to view the slider which allows for the addition or removal of CKUs. As you move the slider, you can see the updated capacity that would be available for each CKU dimension, before you proceed with the resize.
Once you click “Apply changes” you are presented with a confirmation screen that shows how the resize will change the associated base cost for the cluster.
Upon clicking “Continue,” the cluster expansion will commence and you will see a banner in the cluster settings screen that indicates an expansion or shrink operation is in progress. Once the expansion completes you will receive a notification.
Cluster shrink operations follow the same flow except there are certain safeguards in place to prevent shrinking if it will have an adverse impact on the cluster’s performance. In this case, the final confirmation screen will warn you about usage on any CKU dimensions that exceed the capacity of the post-shrink cluster, as well as if the cluster load is above a certain threshold.
The scenario below attempts to shrink a cluster from 2 CKUs to 1 CKU.
Here, the number of partitions on the cluster is more than 4,500 (which is the maximum number of partitions available for a 1 CKU cluster). It is not possible to shrink this cluster unless the number of partitions is reduced to < 4,500.
The confirmation screen displays a warning about the number of partitions being greater than the number of CKUs in the post-shrink cluster will support. Upon clicking “Continue” the shrink operation fails with the error message shown below:
From the Confluent CLI, you can expand or shrink a cluster by using the cluster update command, as shown below. Here an update is issued to expand a 2 CKU cluster to 4 CKUs.
To resize a cluster using Confluent Cloud public APIs, you first need to create a Cloud API key. Once you have the Cloud API key you can use “Basic Auth” and access the cluster update API to add or remove CKUs to the cluster.
The screenshot below shows details the expansion of a 2 CKU cluster to 4 CKUs.
You can see the API call to update the cluster from 2 to 4 CKUs and the response in the screenshot from Postman below:
This post demonstrates elasticity in action for Confluent Cloud dedicated Kafka clusters, highlighting the user experience using Confluent Cloud’s UI, CLI, and APIs as well as details of how the control plane handles cluster resize operations. This post also covered certain key aspects of the overall experience from the users’ perspective. An earlier blog post demonstrated how to remove the brokers and rebalance data on a Kafka cluster. Stay tuned for an upcoming blog post that will detail this from the Confluent Cloud control plane perspective. In the meantime, if you’re ready to get started with Confluent Cloud, use the promo code CL60BLOG for an additional $60 of free usage when you sign up for a free trial.*
Providing the self-serve capability to resize clusters is the first step. In the future, we plan to provide functionality to enable autoscaling on dedicated Kafka clusters. For example, users would be able to set policies on their clusters and auto-scale clusters based on those policies. This would further reduce operational burden and allow users to completely offload capacity management, while ensuring they have the desired capacity when needed without having to proactively monitor and take actions to add or remove capacity.
This blog announces the general availability of Confluent Platform 7.8 and its latest key features: Confluent Platform for Apache Flink® (GA), mTLS Identity for RBAC Authorization, and more.
We covered so much at Current 2024, from the 138 breakout sessions, lightning talks, and meetups on the expo floor to what happened on the main stage. If you heard any snippets or saw quotes from the Day 2 keynote, then you already know what I told the room: We are all data streaming engineers now.