Développez l'apprentissage automatique prédictif avec Flink | Atelier du 18 déc. | S'inscrire
Confluent’s own Bill Bejeck has recently completed Kafka Streams in Action, a book about building real-time applications and microservices with the Kafka Streams API. The book is scheduled to be available next month, but Manning Publications has kindly agreed to let us share my foreword to the book here on the Confluent blog. If you’re interested in learning more about Kafka Streams (and you should be!), this book is an excellent way to get started.
We are witnessing the emergence of a new paradigm for building modern data-centric applications—event-driven architectures powered by Apache Kafka®. The traditional approach to building tightly coupled request/response applications around data transactions is being rapidly augmented by event-driven applications that are instead centered around business events, leading to a far more responsive, flexible, decoupled application architecture. We can now see event-driven applications at the heart of next-generation architectures in every industry imaginable. Big retailers are re-working their fundamental business processes around continuous event streams; car companies are collecting and processing real-time event streams from internet-connected cars; and banks are rethinking their fundamental processes and systems around Kafka as well.
What’s powering this paradigm shift? If the world has lived with request/response applications and batch computing for so long, why is this happening now?
First, businesses are far more digital than they used to be. And this digitization is quickly going beyond the domain of traditional enterprise software systems. The “Internet of Things” refers to many changes, but more than anything, it refers the process of automatically instrumenting things happening in the physical world—from manufacturing processes to automobiles to retail inventory; aspects of businesses that were once off limits to computers are becoming sources of continuous event streams.
Second, there is also a transformation in how data is used inside organizations. In a world where action was only taken by humans, periodic batch-computed reporting might be plenty fast. After all, no human will check the report more than once a day. But in a company which has become significantly digital, in which actions are often taken not only by humans but automatically by software, this kind of delay makes no sense. This shift towards modelling processing around event streams actually makes things simpler. The core data of most modern businesses is most naturally thought of as a continuous event stream (of sales, customer experiences, shipments, etc.). It is natural that the processing and analysis of this data—via event-driven applications—would be continuous, and real-time as well.
The final driver of the change is that the technology for supporting stream processing has gotten much better. Early systems, including things like Enterprise Messaging systems, and Complex Event Processing engines, existed in this space, but were quite limited in the types of problems they could handle. As part of building Apache Kafka, we have learned a great deal about how to create stream processing systems so that they can not only scale horizontally to company-wide usage, but also so that their capabilities are truly a superset rather than a subset of what was possible with batch processing. This area is still evolving rapidly, but forward-thinking companies are already implementing production stream processing applications in critical domains. The rise of Apache Kafka as an Event Streaming Platform that powers stream processing is evidence to this trend of stream processing going mainstream.
So what is stream processing? And what role does it play in enabling an event-driven architecture?
Stream Processing is a data processing paradigm that views data less as something kept in static stores and more as something that flows and is processed continuously. Stream Processing manifests itself in event-driven applications that power the business, as well as, in real-time analytics that report on the business.
Stream Processing requires a fundamental shift in our thinking, away from command thinking towards event thinking; a cultural change that enables responsive, event-driven, extensible, flexible, and real-time applications. In business, event thinking opens organizations to real-time, context-sensitive decision making and operations. In technology, event thinking can produce more autonomous and decoupled software applications and, as a consequence, elastically scalable and extensible systems. In both cases, the ultimate benefit is greater agility—of the business and of the business-facilitating technology. Applying event thinking to an entire organization is the foundation of the event-driven architecture. And stream processing is the technology that enables this transformation.
So what is Kafka Streams? Why did we build it? And how is it different from all the other Big Data processing frameworks?
Kafka Streams is the stream processing library native to Apache Kafka for building event-driven applications in Java to process data in Apache Kafka topics. Applications that use this library can do sophisticated transformations on data streams that are automatically made fault-tolerant and are transparently and elastically distributed over the instances of the application. Since its initial release in the 0.10 version of Apache Kafka back in 2016, many companies have put Kafka Streams into production, including Pinterest, New York Times, Rabobank, LINE, and many more.
There are a wide variety of technologies, frameworks, and libraries for building applications that process streams of data. Big Data processing frameworks such as Storm, Flink, and Spark all have their pros and cons. Kafka’s approach to stream processing—be it Kafka Streams or KSQL—is pretty different from these big data frameworks. Our goal with Kafka Streams and KSQL is to make stream processing simple enough that it can be a natural way of building event-driven applications that respond to events, not just a heavy-weight “big data” thing.
It’s because we tried building a stream processing framework by modeling it as a faster MapReduce layer, and failed. Apache Samza was our initial take on stream processing with Apache Kafka. Well, technically our second take, as the first one was a prototype I called KafkaMR. But adoption of Apache Samza, even inside the company it was introduced at, didn’t quite pan out as expected. The reason is that this big data approach to stream processing poses it as a problem of multiplexing a set of jobs over a cluster of machines. While stream processing is most naturally expressed as event-driven applications where business logic of an application embeds the ability to process events streams.
Kafka’s approach to stream processing—be it Kafka’s Streams API or KSQL—is hence pretty different from these big data frameworks. Rather than model stream processing applications as a big data framework, Kafka models them after REST applications. Rather than conceive stream processing as a kind of Java big data processing framework that happens to transmit streams of data, we invert this. In our model, the primary entity isn’t the processing code at all, it’s the streams of data in Kafka.
Kafka provides a versioned, widely adopted protocol for correct, distributed, fault-tolerant, stateful stream processing. This protocol can be used from any language. One example of this is the Kafka Streams API in Apache Kafka, another is KSQL, and there are several more in Go, .Net, Python, Node, etc. Each of these isn’t all that much code, and the code is, in any case, essentially an implementation detail. As with a REST service, the contract is the input and output of the service, not the thing you used in the internals of your app. And that is the reason Kafka’s approach to stream processing is a powerful enabler of modern stream processing.
I believe that the event-driven architecture, centered around real-time event streams and stream processing, will become ubiquitous in the years ahead. Technically sophisticated companies like Netflix, Uber, Goldman Sachs, Bloomberg and others have built out this type of large-scale event streaming platform, many operating at massive scale. It’s a bold claim, but I think the emergence of stream processing and the event-driven architecture will have as big an impact in reworking how companies make use of data as relational databases did.
Still, learning event thinking and building event-driven applications oriented around stream processing is quite a mindshift if you are coming from the world of request/response style applications and relational databases. This book is a great way to learn about Kafka Streams, and learn how it is a key enabler of event-driven applications. I hope you enjoy reading it as much as I have!
If you have enjoyed this article, you might want to continue with the following resources:
Tableflow can seamlessly make your Kafka operational data available to your AWS analytics ecosystem with minimal effort, leveraging the capabilities of Confluent Tableflow and Amazon SageMaker Lakehouse.
Building a headless data architecture requires us to identify the work we’re already doing deep inside our data analytics plane, and shift it to the left. Learn the specifics in this blog.