Développez l'apprentissage automatique prédictif avec Flink | Atelier du 18 déc. | S'inscrire
The term “event” shows up in a lot of different Apache Kafka® arenas. There’s “event-driven design,” “event sourcing,” “designing events,” and “event streaming.” What is an event, and what is the difference between the role an event has to play in each of these contexts?
We can speak broadly, maybe even a little philosophically, about what events are. Events are “things that happen,” or sometimes, they are otherwise defined as representations of facts. All data is, in a way, a result of humans trying to grok events. At the same time, I honestly don’t find the definition helpful if we leave it at this level. Do we ever design apps around things that don’t happen? (Don’t think about that too hard.)
Let’s get concrete: Events that might affect real-time data pipelines and applications, including things like Pinterest saves, USPS address changes, ship coordinate updates, and credit card transactions.
Now, in the Kafka ecosystem, an event is represented by something that looks like this:
These events usually have partition keys and values and a timestamp, and the whole thing looks suspiciously like an object. So what makes an event different from an object?
The answer lies in how we treat events. The choices we make in choosing values to store in events matter. The way we disseminate (stream) them across a system matters. Those things have an effect on the design of our overall system or pipeline. Instead of making a request to see if an object has been updated, our design reacts in real time to generated events. There are three major approaches to using events in this way:
Designing events: Designing events means carefully choosing the conceptual model for your event representation based on the role the event has to play.
Event streaming: You can stream events in real time, as well as aggregate, filter, and join multiple streams. This process is called event streaming.
Event-driven design: Event-driven design means designing your architecture in a manner that’s informed by the reactive nature of events.
The following paragraphs go a little deeper into each of these terms.
So, what values go here?
That’s the key question in event design.
The answer is informed by the overall structure of your project. We’ll look at some of the details in a second, but the most important aspect is that the values in your event reflect the reactive nature of an event-driven app in a way that considers the perspective of the consumer. The rest of your architecture listens for new events, so with each new event, you can include things like changes from event to event made explicit in the key, like an event with a key of updated_address
.
The event structure can change depending on whether the event is meant to be read internally or externally. For example, it might be OK to include some key changes that result in a tight coupling to the data source internally, but that might get problematic for another team developing a different service.
Relationships among events can also affect the design of your event. If you’ve normalized your event source database, consumers will need to resolve foreign keys and shuffle data, resulting in a tight coupling to your internal data system. To denormalize that data and uncouple your consumers, you can apply many well-established implementation patterns including denormalizing your source database or introducing an abstraction layer.
You’ll also want to consider whether you’re including a single type of event to preserve order in a topic, or storing multiple types of events in a single topic. Note that using one topic can result in strong coupling.
The relationship between events and consumers shapes event design through the flow of data. Discrete flows determine state changes within the consumer app’s state machine (stopping if an order is canceled, for example). Continuous flows, on the other hand, are not ways to manage the application state but rather are a series of independent events (think of application logs or temperature reads).
Finally, there are some best practices to keep in mind:
Use schemas to ensure uniformity of events across your data project
Use headers and timestamps so you know where and how an event was generated
Use event IDs for consumer use cases so that stream audits can reveal missing or duplicate events
When there’s a repeating phenomenon you want to track, you can store the record at rest inside something like a relational database or you can record a stream of events reflecting that phenomenon.
Relating multiple streams of events and creating new ones through operations like filtering and aggregation is what’s called stream processing. In Kafka, these event streams are called topics, storing events in a continuous log. Operations are performed on multiple streams often by a client library like Kafka Streams Java API, or a stream processing engine like ksqlDB or Flink.
Consider this example of an event stream: Say you’re building a mobile application that allows users to review open source GitHub PRs from their phone. You want them to be alerted as soon as a PR was made to any repos they’ve opted into. You also want the user to be aware of any links to that particular PR made in the Discussions or Issues sections on GitHub. You can make streams comprised of events in Discussion and Issues, and join them to the PR stream. This creates an Updates stream feature to surface the relevant data to that PR review interface.
Learn more about the basics of event streaming in this video.
Event-driven design refers to the use of one or more patterns to build architecture in a way that’s “aware” of events. Below are some of the most common patterns:
The event sourcing approach uses events as a data storage model. This means that the events are restricted to a single app with a single data source. Add data streaming, and you make those metadata events available across your entire system. As Martin Fowler says: “Event sourcing works on your data the same way version control works on your code.”
Event notification describes a pattern where you’ve got part of your architecture “listening” for an event and executing logic in reaction to it. For example, instead of sending a request to a database every few seconds to see if a user object has new Pinterest saves, your client listens for new “save” events and executes logic based on that parameter.
Event-carried state transfer refers to a situation in which the state of the downstream systems is updated by incoming events. This is different from an event that carries a notification. A notification event for an e-commerce data pipeline might be something like customerEmailUpdate
rather than something like currentCustomerEmail
, which would carry an update with complete information on the customer object state.
Command query responsibility segregation (CQRS) refers to separating the software processes that write and read data — so events as a result of computations are written to a common data structure by one client, and then read in a separate one. CQRS has performance benefits since the commands can be managed on different nodes, and the read queries can be optimized quickly on their own. Using Kafka is a solid way to implement CQRS because Kafka’s pub/sub-like design effectively decouples data sources and sinks.
This article provides an entryway into the world of events and event-driven architecture. But, if you hear a new term with the word “event” in it, you’re almost certain to be hearing a term relating to a real-time, reactive data structure. If you’d like to venture further across the Kafka threshold, here are some resources related to this blog post:
Designing events and event streams – a comprehensive course that dives deep with exercises to give you confidence in your event design skills
Real-time GitHub commits – if you’d like to start building out the example mentioned earlier, this is a good tutorial to start
Learn how to model events in Java with this data streaming pattern
Apache Kafka 101 – get comfortable with the basics in this introductory course
Event streaming vs. related trends – learn about the other options available to you when it comes to designing your architecture and why you’d pick event streaming
Learn what windowing is in Kafka Streams and get comfortable with the differences between the main types.
If you’ve used Kafka for any amount of time you’ve likely heard about connections; the most common place that they come up is in regard to clients. Sure, producer and consumer clients connect to the cluster to do their jobs, but it doesn’t stop there. Nearly all interactions across a cluster...