[Webinar] Bringing Flink to On-Prem and Private Clouds | Register Now

A Guide to Stream Processing and ksqlDB Fundamentals

Written By

Event streaming applications are a powerful way to react to events as they happen and to take advantage of data while it is fresh. However, they can be a challenge to develop, deploy, and maintain. ksqlDB, the database purpose built for data in motion, allows us to build and deploy these applications using familiar SQL syntax. An easy way to get up to speed on ksqlDB is with the ksqlDB 101 course on Confluent Developer. This course offers a mix of lectures and hands-on exercises that are both fun and informative.

Introduction to ksqlDB

We’ll start with an overview of what ksqlDB is and what we can build with it. We’ll then see how ksqlDB splits up its distributed compute layer and its distributed storage layer and how that enables us to build rich streaming applications backed by Apache Kafka®.

Interacting with ksqlDB

When using ksqlDB in Confluent Cloud, we can use the Confluent Cloud Console to build and manage our event streaming applications. The graphical editor makes it easy to iterate on our queries as well as to create new streams and tables. There are also graphical tools for monitoring the flow of our applications while they are running. We can also use the ksqlDB command line interface (CLI), which works with both Confluent Cloud and self-managed ksqlDB clusters. Anything that we can do with the UI can be done with the CLI—it just might not be as pretty. Another option for interacting with ksqlDB is the native Java client, which we can include in our custom Java applications. In addition, ksqlDB offers a REST API that can be called from any number of tools or from applications written in other programming languages.

Try It Out for Yourself
To see all of this in action, check out the exercise Interacting with ksqlDB. You can sign up for Confluent Cloud using the promo code 101KSQLDB for $101 of free usage.

Creating, importing, and exporting data

To get data into our applications, we can create new streams and tables and tell ksqlDB to create the backing Kafka topics, or we can create streams or tables based on existing topics. Then we can insert data using SQL INSERT statements, or we can operate on data that is being produced to the underlying topics by other applications. Another option is to use ksqlDB’s Kafka Connect integration to pull data from or send data to any external system for which there is a connector.

Filtering events with ksqlDB

Filtering in ksqlDB is as simple as using a SQL WHERE clause, as in SELECT * FROM stream WHERE field = 'value';. We can go further though and use this filter action to create a new stream with its own backing Kafka topic, like so: CREATE STREAM filtered AS SELECT * FROM unfiltered WHERE field = 'value';. Now every event in the unfiltered stream that matches our predicate will become part of our filtered stream.

Try It Out for Yourself
To learn more, follow the hands-on exercise Filtering with ksqlDB.

Lookups and joins with ksqlDB

In most cases, the data that we need to work with comes from more than one source. For example, it may be that we have a stream of events that contain some of the information we need but that other information resides in a lookup table. This table can be loaded into a ksqlDB table and used to enrich that stream of events using familiar SQL JOIN syntax. ksqlDB allows us to join multiple streams and tables to derive the exact data that we need for our stream processing.

Try It Out for Yourself
If you’d like to get started with lookups and joins using ksqlDB, be sure to check out the exercise on Confluent Developer.

Transforming data with ksqlDB

Using ksqlDB, we can derive new streams of data from existing streams and tables, and we can also transform the data that is going into our new streams. We can drop fields, combine fields with CONCAT, rename them with AS, or change their data type with CAST. There are a host of other built-in functions that we can use to get exactly the data we need in the shape that works best for our applications.

Try It Out for Yourself
Try some streaming SQL transformations for yourself with this exercise.

Flatten nested records

Sometimes our data can get pretty complicated, but that’s not a problem for ksqlDB, which supports ARRAY, MAP, and STRUCT data types. If we need to transform a nested structure into a simpler form, we can use the arrow -> operator combined with AS to turn nested fields into top-level fields. Our new stream will contain the simplified form while the original stream remains unchanged. This flexibility allows us to meet new requirements without breaking existing downstream applications.

Try It Out for Yourself
It’s one thing to read about how easy it is to flatten nested records with ksqlDB, but you can also try it yourself with this hands-on exercise.

Converting data formats with ksqlDB

With ksqlDB, derived streams are not limited to the data format of their source streams. We may have a stream of JSON events, but what if a downstream application needs them to be in AVRO? No problem. Using the VALUE_FORMAT property, we can specify the format for any new stream or table. We can also define or modify schemas along the way.

Try It Out for Yourself
ksqlDB even enables us to convert a stream into CSV! Check it out in this exercise.

Merging streams with ksqlDB

We all know we’re not supposed to cross the streams, but sometimes we do need to merge them. If we have multiple streams of similar data, we can combine them into one stream for more efficient processing. Just as we can use CREATE STREAM ... AS SELECT to create a new stream from an existing one, we can also add to that stream using INSERT INTO ... SELECT. We can do this with each stream that we need to merge until we have the dataset that we need.

Splitting streams with ksqlDB

What SQL has joined let SQL put asunder. If we have a single stream that represents data that we need to differentiate, we can derive multiple streams based on the differentiator. For example, if we have a stream of sales data that we need to split into multiple streams based on region, we can do this with a series of queries using CREATE STREAM ... AS SELECT ... WHERE.

Try It Out for Yourself
The following exercises will give you hands-on experience merging and splitting streams.

Streams and tables

While its underlying storage is Kafka topics, ksqlDB represents the data in those topics as streams or tables. Streams are an unbounded series of events, mapping closely to the underlying topic. The events in a stream may or may not have a key. Tables, on the other hand, require a key and represent the most recent value for each key. As new events land in a table’s underlying topic, the value for its key is updated or a new entry is added if the key had not previously existed in the table. Some datasets are easier to work with as a table, such as products, while others, such as orders, make more sense as a stream. Both are useful abstractions that help us build powerful applications.

Stateful aggregations

One way to derive a table in ksqlDB is to use a SQL aggregation, such as COUNT or SUM. There are several aggregation functions available and they will all result in a table. We can use CREATE TABLE AS to turn an aggregation query into a persistent table that will always contain the most recent results of the aggregation. The key to this table will be the field or fields in the GROUP BY clause.

Try It Out for Yourself
The Stateful Aggregations exercise will give you an opportunity to build your own materialized view.

Push and pull queries

Once we have a table that has been materialized from an aggregation, we can query it to get the current value for a given key, just like we would with a relational database. In ksqlDB, this is called a pull query, and it works like you’d expect. But since ksqlDB is an event streaming database, it also provides a query that returns a continuous stream of results. This is called a push query, and it is generated by adding the EMIT CHANGES clause, like so: SELECT * FROM stream EMIT CHANGES. This query will return new results as they arrive in the underlying stream.

Try It Out for Yourself
To really appreciate the value of push queries, it’s best to see them in action. You can do that with this exercise.

Under the covers

ksqlDB might look a bit like magic, but there is some tried and true technology under the covers. We’ll explore that in this module, but for a deeper look, check out the course Inside ksqlDB, available on Confluent Developer.

ksqlDB code lifecycle

Once we have our event streaming application up and running, we will need to be able to maintain it. For this, we can take advantage of ksql-migrations, a scriptable command line tool for managing ksqlDB streams and tables.

Apply lambda functions to arrays and maps

We’ve already seen how ksqlDB enables us to perform filtering, transformations, and aggregations on the events in a stream, but what about when our events contain arrays or maps and we need to do these same actions on each of their elements? For that, we have lambda functions. With lambda functions, we can filter the contents of an array or map field, we can transform each element, or we can run a reduce against all of the elements.

Try It Out for Yourself
Lambdas in ksqlDB are really powerful, and also really fun to work with.  See for yourself with this hands-on exercise.

Next steps

To learn more about what you can do with ksqlDB and to get hands-on practice with it:

Get Started

  • After 30 years as a developer, architect, project manager (since recovered), author, trainer, conference organizer, and homeschooling dad, Dave Klein landed his dream job as a developer advocate with Confluent. After two years with Confluent, Dave joined Tabular, where he helps developers use Apache Iceberg to get more value from their data and have more fun doing it.

Did you like this blog post? Share it now