Kafka in the Cloud: Why it’s 10x better with Confluent | Find out more

What is Observability?

Observability is the ability to measure the internal states of a system by examining its outputs. Whether through application performance monitoring (APM), telemetry data, log analytics, traces, or metrics, the more real time insights you have into your system, the more quickly you can pinpoint performance problems and mitigate risks.

Start Confluent Cloud for free

What problems does Observability solve?

From Observability: A Manifesto by Charity Majors:

“Observability is about getting the right information at the right time into the hands of the people who have the ability and responsibility to do the right thing. Helping them make better technical and business decisions driven by real data, not guesses, or hunches, or shots in the dark. Time is the most precious resource you have — your own time, your engineering team’s time, your company’s time.”

Observability is important because it allows you to spot bottlenecks, resolve outages, and glean valuable insights about how your software behaves.

Telemetry: Correlating Metrics, Traces, and Logs

It is possible to observe complex distributed systems by correlating telemetry data – traces, metrics, and logs.

Traces

A span is an individual unit of work done in a distributed system
As a request kicks off a chain of execution across many distributed systems, those spans are collected into a trace
Traces work by propagating a unique trace ID throughout the system using headers
Apps must be instrumented to send trace information to a backend observability service for analysis

Here is a simple example where you can observe the flow of execution that happens when you call a method called requestStarted, where the entire trace is broken down into its constituent spans:

With this information, it becomes possible to find bugs, isolate performance bottlenecks, or set intelligent alerts.

Metrics

Metrics are measurements of application performance (e.g. memory usage)
Apps must be instrumented to send metrics information to a backend observability service for analysis

Logs

Logs are a history of events for your application
Logs give information that is critical for debugging
Apps must be instrumented to send logs to a backend observability service for analysis

What is OpenTelemetry?

OpenTelemetry is a Cloud Native Computing Foundation (CNCF) project dedicated to creating an open standard for application telemetry instrumentation. OpenTelemetry is emerging as the preeminent telemetry protocol in the observability industry.

OpenTelemtry Protocol (OTLP) provides a standard way to communicate about metrics and traces
- NOTE: As of writing, OTLP's support for logs is considered experimental. See OpenTelemtry Project Status to see the current status of the project. It is recommended to use other logging solutions (e.g. Elastic Filebeat or Loki) in conjunction with OpenTelemetry for full observability.
API -- a specification for public interfaces to be used by libraries to instrument apps to expose telemetry data
SDK -- actual implementations of the API for different programming languages
Use SDK libraries to manually instrument your app code to expose metrics and traces over OTLP to different backend analytics systems
Collector -- the OTEL collector imports and exports telemetry data with different protocols

Popular Observability Tools and Platforms

Sometimes metrics go wonky and make you raise an eyebrow, at which time you’ll check the traces. Sometimes traces show increased latency, which makes you check on the metrics to see what they might tell you. The application may be emitting logs that provide more information about what was going on at the time of degraded performance. In this way, we start to see how analyzing metrics, traces, and logs together using an observability platform makes it much easier to understand and troubleshoot your complex distributed systems. Here are some popular application performance monitoring (APM) observability platforms:

AppDynamics
Datadog
Dynatrace
Elastic Observability
Grafana Labs
Honeycomb
Lightstep
NewRelic
Splunk

Here are some popular open source tools that do not use OTLP natively:

Jaeger -- open source tracing solution (no metrics, no logging)
Prometheus -- very popular open source metrics database (no tracing, no logging)
- Grafana is often used as a visualization tool on top of Prometheus
Loki – open source log aggregation system

Several of these companies use Confluent to move and process telemetry data at high throughput and low latency, and you can use Confluent’s fully managed connectors to more easily integrate with the observability backend of your choice. Confluent has fully managed sink connectors for Datadog, Splunk, and Elastic, with many more being added all the time.

Confluent also offers a metrics API to give users observability into their own Confluent Cloud usage, with first-class integrations to Datadog and Grafana Cloud.

What is Data Observability?

Logs, metrics, and traces pertain to observability as it relates to application performance monitoring (APM). However, businesses are also highly interested in observing how business data flows end-to-end. This is called data observability and is often spoken about in the context of “data governance”.

For example, Confluent Cloud offers a powerful Stream Lineage interface to observe data as it flows throughout a business.

See the Stream Lineage Documentation for more info

The purpose of this lab is to explore a working example of how OpenTelemetry enables metrics and traces in Java using the OpenTelemetry Java agent. What is nice about the Java agent is that it automatically sends telemetry data by simply instantiating Meter and Tracer objects and setting some environment variables.

There is a SpringBoot Java application that exposes an endpoint at http://localhost:8888/hello. App metrics and request traces are sent via OTLP (OpenTelemetry Protocol) to an observability backend (Elastic Observability APM in this case).

Launch the lab environment by clicking [https://gitpod.io/#https://github.com/riferrei/otel-with-java[^]](https://gitpod.io/#https://github.com/riferrei/otel-with-java[^]). ** On launch, all services are built and started with docker-compose.

Inspect the source code of the HelloApp Java application. Specifically, look at src/main/java/riferrei/otel/java/HelloAppController.java. This is where OpenTelemetry tracing and custom metrics are implemented.

Send GET requests to the Hello app.

[source,bash]

----

curl http://localhost:8888/hello

----

Repeat the previous curl command several times to the /hello endpoint as well as others (other endpoints are expected to result in error responses).

Execute the following echo command and Ctrl+Click the resulting URL to open the traces for the hello-app in the Kibana UI.

[source,bash]

----

echo https://5601-${GITPOD_WORKSPACE_URL#https://}/app/apm/services/hello-app/transactions

----

NOTE: The URL will look something like https://5601-aquamarine-python-rsq28cwb.ws-us17.gitpod.io/app/apm/services/hello-app/transactions

Scroll down to the bottom of the page and select the /hello endpoint from the Transactions section.

Scroll down again to see the trace sampling, which shows latency measurements at various stages of execution. + TIP: These trace samples are a great tool for understanding what is happening in a transaction. In more complex applications, this trace sample would show the flow of execution across many microservices, helping you to identify bugs and performance bottlenecks much more quickly.

Execute the following echo command and Ctrl+Click the resulting URL to open the "discover" area of the Kibana UI, where OpenTelemetry metrics will be automatically discovered.

[source,bash]

echo https://5601-${GITPOD_WORKSPACE_URL#https://}/app/discover

NOTE: The URL will look something like https://5601-aquamarine-python-rsq28cwb.ws-us17.gitpod.io/app/discover. Ignore warnings.

Investigate the custom.metric.heap.memory and custom.metric.number.of.exec, which are the custom metrics defined in Constants.java and HelloAppController.java.

NOTE: This lab comes from [https://github.com/riferrei/otel-with-java[^]](https://github.com/riferrei/otel-with-java[^]), created by Ricardo Ferreira. There is a sibling repository at [https://github.com/riferrei/otel-with-golang[^]](https://github.com/riferrei/otel-with-golang[^]). The main difference between the Java and Go implementations is that the OpenTelemety Java agent creates trace spans automatically while Golang requires more manual instrumentation. There is an associated in-depth video walkthrough from SREcon 2021.

In this lab, you explored how to instrument a Java application using OpenTelemetry and send those app metrics and request traces to an observability backend for analysis.

From the original creators of Apache Kafka, learn why Confluent’s data streaming technologies are used by 70% of the Fortune 100. Build real-time data pipelines, unlock real-time data governance, and stream data from infinite souces for seamless data observability, monitoring, and metrics on any cloud.

Learn more Start confluent for free

More Resources

Repo: OpenTelemetry with Java by Ricardo Ferreira
- Associated video walkthrough
Repo: Kafka Observability by Chuck Larrieu Casias and Nacho Muñoz Gómez
- Explore observability hands-on in the context of Apace Kafka clients
Documentation: What is OpenTelemetry?
Documentation: Confluent Stream Lineage