[Webinar] Bringing Flink to On-Prem and Private Clouds. Register Now
In large organizations, Confluent Cloud is often simultaneously accessed by many different users along with business-critical applications, potentially across different lines of business. With so many individual pieces working together, the risk of an individual outage, error, or incident affecting other services increases. An incident could be constituted by a user clicking a wrong button, an application’s misconfiguration, or just a bug—you name it. Ideally, you want your systems and applications to run smoothly all the time, but if there is an incident, it is key to decrease the damage by keeping the time to solve it to a minimum.
In most cases, you will observe a disaster (e.g., an unavailable UI) but you will not know the root cause directly. You may then call engineers or support (perhaps at night) to detect and troubleshoot the error. However, a better way would be to notify the responsible platform or application team proactively once a possible incident occurs and give them the additional information they need to fix it instantly. This is where audit logs come into play.
Audit logs offer an automatic way of tracking all of the different interactions that happen within a Confluent Cloud installation. They are emitted for many events, such as the successful authorization to change an Apache Kafka® topic configuration, a connector creation, or a role-binding association. Essentially, they can be used to analyze who is doing what in your Confluent Cloud installation. A full overview of all audit logs can be found in the Confluent Cloud documentation.
In troubleshooting conversations, we often find ourselves asking, “What do the audit logs say?”—especially after an incident has happened, like an application not being able to authenticate to Confluent Cloud anymore, or a resource being deleted without anyone noticing. So we decided to write this blog to provide some best practices for using Confluent Cloud audit logs proactively, for example, by sending notifications to detect root causes, so that incidents may be solved as quickly as possible.
This article provides a complete conceptual guide for developing your own audit log alerting service. Technically, it walks you through the creation of a fully managed pipeline that:
Transfers audit logs from an external Confluent Cloud cluster into your own Confluent Cloud cluster using Cluster Linking
Sends the data to Splunk via Kafka Connect
Defines alerts in Splunk on certain audit log events
Sends the alerts to your designated email address to let you know when there are possible issues
We focus on the individual configurations you will need to integrate Splunk with Confluent Cloud, as well as the individual Splunk search and alert parameters you should use to send the Confluent Cloud audit log events proactively. (Security aspects, performance tuning, and costs are only mentioned as side notes.)
You can see the final workflow in the video below:
Note that this guide serves as an extension of the How to Visualise Confluent Cloud Audit Log Data blog by Johnny Mirza. We recommend also checking it out, especially for additional Splunk dashboards.
For this blog post’s demo, we will use the Splunk Cloud Platform. According to Splunk documentation, you can get a 14-day free trial that allows you to insert 5 GB of data per day, which is sufficient for our setup.
Once you register, you will receive an email containing the URL (https://<abc>.splunkcloud.com) of your Splunk instance, your username, and the password. Use the URL and credentials to log into Splunk Cloud Platform.
Next, in Splunk under “Settings” → “Data inputs,” add a new “HTTP Event Collector” named “Confluent Audit Logs” that uses the default configurations. You will need the token from the collector later for when we deploy the fully managed Splunk Sink Connector.
According to Confluent documentation, “Audit log records in Confluent Cloud audit logs are retained for seven days on an independent Confluent Cloud cluster.” Thus, as a best practice, we will proceed by first transferring the audit logs to our cluster using Cluster Linking. Having the audit logs stored in our cluster enables us to configure a custom retention time, to process the logs (e.g., by filtering for certain events), and to send the logs to external systems like Splunk via fully managed connectors.
The documentation “Use Cluster Linking to Manage Audit Logs on Confluent Cloud” provides a great step-by-step guide for setting up the connection. In simple terms, it explains how to retrieve the required information from the external cluster, and how to set up the cluster link so that the confluent-audit-log-events
topic is mirrored. Be aware that you generally need to have a Dedicated Confluent Cloud cluster to use Cluster Linking. (Although later in this blog post we will share possible workarounds for this and other challenging scenarios.)
Once the audit events are being produced to your Confluent Cloud cluster, you’ll need to deploy the fully managed Splunk Sink Connector. For the authentication configuration, you should obtain the HEC URI
from the Splunk registration email, and the HEC Token
from the HTTP Event Collector. You can use HEC SSL Validate Certificates
to enable or disable HTTPS certification validation, which is important for the secure connection between the connector and the Splunk instance. If you wish to enable SSL, you will need to upload the Splunk HEC SSL Trust Store file (containing the certificates required to validate the SSL connection) and the Splunk HEC SSL Trust Store Password. For more information, we recommend checking out the corresponding Splunk documentation. In the demo, for simplicity’s sake, we’ll set HEC SSL Validate Certificates
to “false.”
The full configuration looks like the following (although most configurations are just the defaults):
In some cases, it isn’t possible to follow best practices and use Cluster Linking and the fully managed Splunk Sink Connector to get data from the external cluster to Splunk. In the following table, we share some workarounds for these various scenarios:
Pipeline part | Scenario | Workaround |
---|---|---|
Transfer the audit logs from the external cluster to your cluster | • You do not have a Dedicated cluster • You cannot use Cluster Linking because of networking restrictions (see also Manage Private Networking for Cluster Linking on Confluent Cloud) | |
Transfer the audit logs from your cluster to Splunk | • You cannot use a fully managed connector (e.g., due to network restrictions) | |
Transfer the audit logs directly to Splunk | • You need to transfer the audit logs directly from the external cluster to Splunk (for any of the above reasons) | Self-managed Splunk Sink Connector with custom configurations (see How to Visualise Confluent Cloud Audit Log Data) |
You may want to begin by viewing a general Splunk-provided video tutorial about setting up alerts: Creating Alerts in Splunk Enterprise.
Next, we provide a step-by-step example for setting up an alert that triggers when a Confluent Cloud cluster is deleted. From a business point of view, this alert does not necessarily mean that something suspicious is happening, since clusters may be created and deleted for testing purposes, however, it could be critical if the deleted cluster is a production one.
In Splunk, go to “Search & Reporting,” where you can query your data. You’ll find your data under the “Confluent Audit Logs” source (the name we set for the HTTP Event Collector), with the method name “DeleteKafkaCluster”.
Next, under "Save As,” define the alert (see the image below for all example settings). Unfortunately, by default, real-time alerts are disabled in Splunk Cloud, so we need to go with a scheduled one. Essentially, every hour an alert will be triggered if one or more Confluent Cloud clusters are deleted. We are choosing to send an email but you could also configure a webhook to send the message to your preferred system, such as Slack or Microsoft Teams.
That’s it! You can inspect and edit the defined alert under “Alerts” in Splunk. So now if a Confluent Cloud cluster is deleted, you have a corresponding alert defined, so you will be automatically notified.
Try triggering the alert by deleting a cluster in Confluent Cloud (but do not delete the one that is used for this pipeline!). At the next full hour (due to the required scheduled job limitation), you will receive an alert email entitled “CCloud Cluster deletion.”
The email will also contain the raw audit logs event so that you can further analyze which cluster has been deleted by whom and when.
In the case where you want to be notified only when certain clusters are deleted (e.g., production), you can modify your Splunk search by adding: (data.cloudResources{}.resource.resourceId="<lkc-prod1>" OR data.cloudResources{}.resource.resourceId="<lkc-prod2>")
.
Essentially, the steps described above can be modified and repeated for every other alert you’d like to define. Make sure you get the proper results via the Splunk search (a good start is to scan the fields of the corresponding audit log event you want to set the alert for), and that you define the right trigger condition.
The following table of commonly used alerts is based on discussions with colleagues and customers. Because customers have different requirements and new features are released on Confluent Cloud frequently, this table is not exhaustive. If there are additional alerts that you would like to add, we encourage you to collaborate on the corresponding GitHub repository or to reach out to me directly via LinkedIn.
Alert | Audit log | Splunk search |
---|---|---|
CC cluster deletion |
| |
Environment deletion |
| |
Authentication failure to CC cluster |
| |
Authentication failure to Schema Registry |
| |
Authorization failure to CC cluster resource |
| |
Authorization failure based on IP filter |
| |
OrganizationAdmin role binding associated |
|
Because there are several pieces involved in your pipeline, we recommend monitoring its health:
For Cluster Linking, you can use the Confluent Cloud Metrics API to check the cluster link count and the mirror topic offset lag.
Fully managed Confluent Cloud connectors emit Connector events when a task or connector fails. We recommend sending these notifications (Connector in “FAILED” state) to yourself via email, or to your preferred system such as Slack or Microsoft Teams. For more information see Notifications for Confluent Cloud.
You might also consider adding a health metric alert in Splunk that checks if at least one new audit log event is produced to Splunk over a given time interval. The corresponding Splunk search is simply source="http:Confluent Audit Logs"
and the time interval for the alert depends on your individual requirements, e.g., “is less than 0 for x hours/days.”
We have successfully created a pipeline that proactively notifies us when certain events happen in a Confluent Cloud installation—based on audit logs.
We hope it can help you to replace “Let me analyze my audit logs to see what happen-ed” with “I proactively know when something suspicious is happen-ing in my Confluent Cloud installation, and I instantly have the right data to analyze it further.”
You can find the code for the demo in this GitHub repository.
Today, we are excited to announce that Amazon EventBridge has joined the Connect with Confluent program, marking it the third AWS service to become part of this collaborative effort.
Confluent’s OpenSearch Sink Connector lets you easily send events to AWS OpenSearch and others—enabling fraud detection, log analytics, social media monitoring & GenAI w/RAG.