Level Up Your Kafka Skills in Just 5 Days | Join Season of Streaming On-Demand
Market data analytics has always been a classic use case for Apache Kafka®. However, new technologies have been developed since Kafka was born.
Apache Flink® has grown in popularity for stateful processing with low latency output. Streamlit, a popular open source component library and deployment platform, has emerged, providing a familiar Python framework for crafting powerful and interactive data visualizations. Acquired by Snowflake in 2022, Streamlit remains agnostic with respect to data sources.
We can take advantage of the growth in the data landscape and use all three of these technologies to create a performant market data application. This article walks through how to use Streamlit, Kafka, and Flink to create a live data-driven user interface.
In part 1 of this series, we’ll make an app, hosted on Streamlit, that allows a user to select a stock, in this case SPY, or the SPDR S&P 500 ETF Trust. Upon selection, a live chart of the stock’s bid prices, calculated every five seconds, will appear.
What are the pieces that go into making this work? The source of the data is the Alpaca Market Data API. We’ll hook up a Kafka producer to the websocket stream and send data to a Kafka topic in Confluent Cloud. Then we’ll use Flink SQL within Confluent Cloud’s Flink SQL workspace to tumble an average bid price every five seconds. Finally, we’ll use a Kafka consumer to receive that data and populate it to a Streamlit component in real time. This frontend component will be deployed on Streamlit as well.
We’ll use the market data websocket endpoint. There are ways to use REST APIs with Kafka—if you’re interested in that, give this demo a whirl. But we’d like our data transfer to be as instantaneous as possible, with the sub-second latency we’re used to with Kafka, so we don’t have time for REST API request and response cycles.
To see the data coming in from the websocket yourself, use websocat:
To subscribe to that endpoint, we call a subscribe function from the Alpaca API. This function includes a callback specifying a partial function, fn,because we need to pass the stockname to the handler:
This in turn specifies the quote_data_handler
function. This is where the data from the websocket will flow.
Ok, we’ve got the stock market records coming in from the websocket endpoint. Now, we have to produce them to a Kafka topic (which we’ve already set up in Confluent Cloud). We’ll instantiate the producer and set up a JSON serializer (using the Apache Kafka Python client), then feed it the topic name, which we’ll set up to be the same as the stockname.
Now, when we check the Kafka topic in Confluent Cloud, we can see the messages coming in. They’re composed of three parts: bid_timestamp
, price
, and symbol
.
Now that we have our data flowing into a Kafka topic, we need to process it. We want tumbling windows, five seconds apart. To achieve this, we’ll crack our knuckles and open up a Flink SQL workspace on Confluent Cloud.
Wait, what’s a Flink SQL workspace? Further, what’s Flink SQL? Well, let’s take a step back and look at what Flink is. It’s a stream processing framework specifically designed for handling complex, stateful streaming workloads. On a high level, Flink uses checkpointing to create snapshots of state and stores those instead of the whole state history, which makes it highly efficient.
There are three APIs of note here, each at a different level of abstraction, for interacting with Flink. As with most API groups, the higher level APIs offer a faster onboarding experience at the expense of more control. On the other hand, the lower level APIs require a higher level of learning to use properly but offer more granular access to the underlying technology. At the lowest level of abstraction is the Datastream API, which offers developers an expressive way to use the elements of data streaming like windows and joins. One level up is the Table API, which centers around Flink Tables and involves writing less code than the Datastream API. Flink SQL is at the highest level of abstraction. It allows you to use SQL as a declarative approach for implementing unified batch and stream workload processing.
For this project, we’ll use Flink SQL with Confluent Cloud. We’ll use Flink by provisioning a compute pool representing the resources used to run our SQL statements. We can create these statements in the workspace provided in Confluent Cloud’s user interface.
Now here’s a key thing to understand about Flink tables: they are not where data is stored. The data we’re processing is stored in a Kafka topic. That means we need schemas for the data we produce to Kafka topics to be processed with Flink.
Here’s what a JSON schema could look like for our topic with records including a price, a bid_timestamp
, and a symbol
.
And here, as highlighted above when we were talking about the producer, you can see how it’s added to the producer, registering the schema, passing it to the JSON serializer, and finally using it to serialize the produced message.
Once that was done, we could create a table, and then process the data in a Kafka topic using windowing. Here’s the syntax. Let’s go through it line by line.
[1] Here, we’re inserting the result into the destination table.
[2] Here, we select four values from the source table. symbol
identifies the stock name. window_start
is the start of the window and formats it (note that this will be in event time as gleaned from the app), as window_end
is the end of that window. We’re formatting the date here because it will make it easier to display in the front end without having to massage the message as much.
[3] This specifies the row table, the interval, and the watermarking strategy via DESCRIPTOR
. $rowtime
is the value of the Kafka record timestamp, provided by the technology behind Confluent Cloud.
[4] We group the results by the symbol
, window_start
, and window_end columns
.
Flink tables are a description of how to view the data stored in Kafka. So really, we don’t have to ‘get the data back into’ a Kafka topic, as the topic is created to store the data once we make the table. The data processed by FlinkSQL is not stored in a FlinkSQL table.
And the data, stored in a tumble_interval_SPY
topic, is comprised of records that look like this after Flink processing:
That’s the information we need for our live chart! The price will be represented by the y-axis, and the difference between the window end and start provides the value for the x-axis.
That means that we can consume data from our final destination, the Streamlit app, right away … or can we?
The producer and consumer run on two different threads, and without the async.io library in use, we weren’t able to run them at the same time from the same Streamlit application.
Now Streamlit itself is multithreaded, and in fact, this behavior caused us to run into a difficulty with the Alpaca rate limits, but we’ll talk about that, as well as the solution to running the Kafka producer and consumer in the Streamlit app, in our next installment on this topic. We’ll also tell you what we learned about handling multithreading from our colleague, Gilles Philippart.
Beyond that, in part 2 we’ll complete our journey through the project by examining how we surface the data to Streamlit using a bit of data visualization.
Part two in the series on using FlinkSQL, Kafka, and Streamlit dives into async.io, FlinkSQL syntax, and Streamlit barchart component structure.
Learn why stream processing is such a critical component of the data streaming stack, why developers are choosing Apache Flink as their stream processing framework of choice, and how to use Flink with Kafka.