[Webinar] Shift Left to Build AI Right: Power Your AI Projects With Real-Time Data | Register Now
In New Relic we’ve had a lot of problems with a single broker causing disproportionate issues for Kafka processing that we never expected. We go in depth in the different scenarios that allow this to happen, the configuration which we had chosen in hopes of the best which made these outages possible or worse, and what we did to reduce the impact and still keep Kafka configured as desired.
The outages vary from shallow broker health checks combined with slow storage and certain producer configuration leading to 20+ minute full service outage because caused by a single broker. Or in another case simply trying to consume data from a broker in the same availability zone resulting in blocked processing after a broker reboots in the same AZ as the consumers. And also how we solved routing around bad brokers when producers use a partition key (which makes it a harder problem).