[Webinar] Bringing Flink to On-Prem and Private Clouds. Register Now
When you encounter a problem with Apache Kafka®—for example, an exploding number of connections to your brokers or perhaps some wonky record batching—it’s easy to consider these issues as something to be solved in and of themselves. But, as you’ll soon see, more often than not, these issues are merely symptoms of a wider problem. Rather than treat individual symptoms, wouldn’t it be better to get to the root of the problem with a proper diagnosis?
If you're looking to level up your Kafka debugging game and understand common problems as well as the ailments that an individual symptom could be pointing to, then this blog series is for you.
Throughout this blog series, we’ll cover a number of common symptoms you may encounter while using Kafka, including:
Increased Request Rate, Request Response Time, and/or Broker Load
Increased Connections (this post)
These issues are common enough that, depending on how badly they’re affecting your normal operations, they might not even draw much attention to themselves. Let’s dive into each of these symptoms individually, learn more about what they are and how they make an impact, and then explore questions to ask yourself to determine the root cause.
In this post, we’ll cover…
If you’ve used Kafka for any amount of time, you’ve likely heard about connections; the most common place that they come up is in regard to clients. Sure, producer and consumer clients connect to the cluster to do their jobs, but it doesn’t stop there. Nearly all interactions across a Kafka cluster occur over connections, so they’re admittedly pretty critical.
But there’s such a thing as being too connected. Too many connections across a cluster can bog down brokers, potentially impacting requests.
Before diving into an issue caused by increased numbers of connections, it’s important to know the types of connections that are made across your cluster and when they are being made.
Every time a producer or a consumer client wants to write or read data from a Kafka cluster, they initiate and maintain a connection to the brokers. That makes sense. Consumer clients that are a part of consumer groups also have the added responsibility of maintaining a connection, sending heartbeats, and providing their membership to the ConsumerGroupCoordinator––which is running from within a broker.
Connections are made as we produce data to and consume data from Kafka. That makes sense. But how many connections are made? Well, it depends on a combination of the number of topics, partitions, and brokers involved as well as a bit of chance.
For both consumers and producers, the number of connections from a single client is capped by the number of topic-partitions with which the client is interacting. Producers have the ability to potentially produce to every partition within a given topic, so it’s possible that a single producer has to maintain an open connection to every broker depending on where the lead replica of every topic-partition resides. Consumers, on the other hand, can be more efficient in their connections to brokers. Consumers can act within a consumer group and, as such, they will only have a set number of topic-partitions from which to consume. It’s also important to note that when a consumer or producer starts up for the first time, it will connect to one of the bootstrap servers to receive necessary metadata.
As an example, consider the case where a producer is writing to a topic with 3 partitions and is operating in a cluster with 5 brokers. In this scenario, the producer won’t necessarily maintain an open connection to the brokers that don’t contain partitions for that topic. So we’d just need 3 connections for that producer.
All that being said, it’s actually possible that this producer will maintain 4 open connections depending on which broker the producer connects to on start up. Note that this 4th connection will only be used for the initial metadata call and may not be maintained as long as other connections. This is affected by metadata.max.age.ms
(default 5 minutes) which controls the interval at which metadata is refreshed and connections.max.idle.ms
(default 9 minutes) which allows idle connections to be cleaned up and dropped.
Brokers connect with each other, but this depends on specific cluster settings. For example, when in-sync replicas are enabled for a cluster, brokers that contain follower instances of a given topic-partition will maintain an open connection between itself and the broker on which the lead topic-partition resides. It uses this connection to periodically fetch data from the leader and stay in-sync.
It doesn’t end there! Depending on what kinds of applications you’re building, there are other ways that connections can be made across your cluster. For example, the AdminClient will create individual connections for each topic it attempts to create.
We’re not saying it always comes down to metrics, but it doesn’t not always come down to metrics. When it comes to the number of connections to your cluster at any given time, there are a couple broker and client metrics to keep in mind.
kafka.server:type=socket-server-metrics,listener={listener_name},networkProcessor={#},name=connection-count, kafka.producer:type=producer-metrics,client-id=([-.w]+),name=connection-count
and kafka.consumer:type=consumer-metrics,client-id=([-.w]+),name=connection-count
: Quite simply, this is the total number of active connections to the brokers at any given time. While each of the brokers in the latest Kafka versions can handle thousands of simultaneous connections, you’ll want to keep an eye on the trend of your connection counts. Any unexplained spikes could be cause for concern. The same measurement is also conveniently available as a consumer and a producer metric so that you can see the breakdown for your clients.
kafka.server:type=socket-server-metrics,listener={listener_name},networkProcessor={#},name=connection-creation-rate, kafka.producer:type=producer-metrics,client-id=([-.w]+),name=connection-creation-rate
and kafka.consumer:type=consumer-metrics,client-id=([-.w]+),name=connection-creation-rate
: This broker-, producer-, and consumer-level metric goes hand-in-hand with connection-count, showing the number of new connections that are being created per second. It’s a good metric to alert on in order to pinpoint a connection storm as it happens and also identify which type of client (producer or consumer) could be causing the issue.
kafka.network:type=Acceptor,name=AcceptorBlockedPercent,listener={listener_name}
: Internally at Confluent, this metric is crucial for identifying when connection storms are occurring through Confluent Cloud; it’s just as important for you to be aware of. Use it in conjunction with any listener, e.g. replication listener or another external one. This metric will give you insights into the percentage of requests that the listener is being blocked from receiving. As an example, for the replication listener, this value will identify any bottlenecks that might be happening in your replication process. Ideally, this value will be 0; any positive value indicates that connections are being throttled.
In addition to seeing an increased number of connections…
… do you see increased memory consumption? You may want to check if you erroneously created one consumer per thread within a single service instance. See the explanation in the increased rebalance time diagnosis section for more details. But the bottom line is that if you’re moving to a multi-threaded consumer model, avoid creating a consumer per thread as it can increase connections and memory consumption.
… are you witnessing an increased consumer group size and more time to rebalance? Check into your cloud-based KafkaConsumer workloads to see if they’re undersized. This has come up a few times in this blog series; it’s especially relevant in a world with cloud-based Kafka services. If you’re using cloud-based Kafka, it’s reasonable to say that one of the first things you should check when you encounter any issue is whether or not your cloud-based workloads are appropriately sized. It may just save you some time!
… have you seen an increased rate of requests? This could indicate that you’re using multiple KafkaProducer instances within a single service or process. Maybe you’ve recently migrated from another messaging technology and were trying to minimize code changes or perhaps you didn’t quite understand the thread safety of a KafkaProducer. Either way, it could be time to check into your client code.
Given their nature, broker connections can be tough to understand and keep track of, but that doesn’t mean that you can’t have control of your Kafka cluster! With a fresh understanding of all of the connections being made across your cluster and metrics to watch, you should be able to debug and diagnose your next connection-related issue with more confidence.
To continue on in your Kafka practice, check out these other great resources to help you along the way:
Avoid potential problems entirely by diving into common Apache Kafka mistakes and pitfalls that everyone is likely to encounter.
Give Confluent Cloud a try and see what fully managed, cloud-based Kafka has to offer you.
Plug into a community–check out the Confluent Community Slack and Confluent Forum.
Rebalancing comes into play in Apache Kafka® when consumers join or leave a consumer group. In either case, there is a different number of consumers over which to distribute the partitions from the topic(s), and, so, they must be redistributed and rebalanced....
Apache Kafka® is an event streaming platform used by more than 30% of the Fortune 500 today. There are numerous features of Kafka that make it the de-facto standard for […]