[Webinar] Deliver enterprise-grade Apache Kafka® to your customers | Join Now

Revolutionizing Failure Management in Apache Flink: Meet FLIP-304's Pluggable Failure Enrichers

Written By
  • Adi PolakDirector of Advocacy and Developer Experience Engineering

TL;DR

FLIP-304 gives you the power to transform how failures are handled in Apache Flink®. With custom failure enrichers, you can tag and classify errors, integrate enriched data with monitoring tools, and quickly identify root causes.

Failures in distributed systems are inevitable, but managing them effectively makes all the difference. Enter FLIP-304: Pluggable Failure Enrichers, an upgrade that helps turn confusion into actionable insights.

Why should you care?

You’re deploying a distributed application on Apache Flink. It’s running smoothly, handling data streams like a champ. Then—a failure strikes. The only clue? A vague error message saying, “Something broke.” Not very helpful.

Traditional failure reporting in distributed systems often leaves critical questions unanswered:

  • Is it a code issue or an infrastructure problem?

  • What triggered the failure?

  • How can I monitor and act on similar failures in the future?

Without detailed insights, debugging becomes a time-consuming process.

Example of FLIP-304 in action

Imagine running a real-time data streaming pipeline that processes user behavior data. Suddenly, a task fails in one of the Flink jobs. The standard error message says:

Task failed: Slot 3, Operator: FilterTransformation Cause: NullPointerException

Not very helpful. However, with a custom failure enricher, the enriched metadata reveals:

  • failure.type: UserError

  • priority: High

  • operator: FilterTransformation

  • slot: 3

  • strategy: SKIP

Now you can quickly identify that the issue stems from a user-defined function in the FilterTransformation operator and prioritize the fix accordingly.

Curious how you can leverage it? Continue reading.

Enter FLIP-304: Turning failures into actionable insights

A developer encounters a crash in their Flink application and wonders, “Is this my code or an infrastructure issue?” FLIP-304 provides the answer. With pluggable failure enrichers, Flink not only lets you pinpoint failure triggers—whether it’s a coding mistake or an operational hiccup—but also supports proactive alerting. For instance, you can configure notifications to alert a specific on-call team via tools like PagerDuty if specific errors or error patterns occur, streamlining both debugging and incident response.

Step by step

Step 1: Set up the configuration

To enable FLIP-304, configure your enrichers in Flink’s settings:

failure-enrichers: com.example.CustomFailureEnricher  

Step 2: Write your custom enricher

Next, write your custom FailureEnricher implementation. Here’s a quick example:

public class CustomFailureEnricher implements FailureEnricher {
   private static final String typeKey = "TYPE";

   @Override
   public Set<String> getOutputKeys() {
        return Stream.of(typeKey).collect(Collectors.toSet());
   }

   @Override  
   public CompletableFuture<Map<String, String>> enrich(Throwable cause, final Context ctx) {  
        final Map<String, String> labels = new HashMap();
       if (ExceptionUtils.findThrowable(cause, ClassCastException.class).isPresent()) {
           labels.put(typeKey, "USER");
       } else {
           labels.put(typeKey, "SYSTEM");
       }
       return CompletableFuture.completedFuture(labels);

    }  
} 

This small snippet categorizes user-related errors and assigns them high priority. The labels hashmap object lets you tag failures with metadata like key-value pairs.

Step 3: Query enriched data

Here’s a query and response example. To see enriched exceptions, use this command:

curl http://<jobmanager-address>:8081/jobs/<job-id>/exceptions  

Response example:

{  
  "failures": [  
    {  
      "timestamp": 1693048514000,  
      "message": "Task failure in slot 3",  
      "labels": {  
        "failure.type": "UserError",  
        "priority": "High"  
      }  
    }  
  ]  
} 

This enriched data integrates seamlessly with monitoring tools like Grafana, Prometheus, and Flink UI itself, giving you full visibility into your system's failures.

(right-click to open full size-image in new tab)

(right-click to open full-size image in new tab)

Why FLIP-304 matters

FLIP-304 isn’t just about making failures easier to understand—it’s about empowering you to:

  • Identify whether issues stem from code, infrastructure, or operations.

  • Use enriched data to build targeted dashboards and alerts.

  • Reduce debugging time and improve overall system reliability.

With FLIP-304 you can also build more sophisticated solutions through its pluggable design. For example, you can extend the failure enrichment logic with custom code to code your own Customer-Defined Error and Exception Handling Strategies. This flexibility supports additional dependencies, error types, or integration with external systems to create a dedicated response for specific scenarios. If you think about it, overall it enables you to optimize your system reliability. 

Ready to start? Want to learn more?

Failures can tell you a lot—if you’re ready to listen. FLIP-304 gives you the tools to not just wake up in the night to fix an exception, but to also have the necessary insights and metadata to understand what to fix. New to Flink? Start with the basics, then explore Confluent Cloud to level up your Flink application failure management game.

Apache®, Apache Flink®, and Flink® are registered trademarks of the Apache Software Foundation.

  • Adi is a Director of Advocacy and DevX engineering at Confluent. She specializes in data streaming, analytics, and the new generation of AI systems. You can find her on various social media platforms @adipolak.

Did you like this blog post? Share it now