Build Predictive Machine Learning with Flink | Workshop on Dec 18 | Register Now
We’re excited to talk about our vision for Tableflow, which makes it push-button simple to take Apache Kafka® data and feed it directly into your data lake, warehouse, or analytics engine as Apache Iceberg® tables. Making operational data accessible to the analytical world is traditionally a complex, expensive, and brittle process and we believe we can do better to unify the operational and analytical estates.
Tableflow removes all this erroneous, duplicative work and helps convert Kafka topics and associated schemas to Iceberg tables in one click. This is central to our Confluent’s vision to build the world’s leading data streaming platform that fuels any operational and analytical workload with real-time data products.
Our partners are just as excited about Tableflow. At Kafka Summit London, we announced our Tableflow vision with a set of launch partners such as Snowflake, Amazon Athena, Dremio, Imply, Starburst, OneHouse, Tabular, and more. These partners have already built heavily around Iceberg, and Tableflow represents an incredibly convenient way of getting data into their platforms.
Tableflow is currently in early access, but we’re excited to share more about why we're investing here, what we’re building, and where we’re going.
We wanted to set some context since it might seem a little odd for the data streaming company to dive into the open-table space. Data in most organizations are typically split into two estates:
Operational estate: The SaaS apps, ERPs, custom applications, etc. that serve the needs of applications to transact with customers in real-time
Analytical estate: The data warehouses, lakehouses, AI/ML platforms, and other custom batch workloads that support business analysis and reporting
The operational estate is where streaming has really grown up, built up by custom applications, SaaS tools, and other business-critical use cases. The analytical estate is a little different. It’s also often fed by streams but is a collection of different warehouses, query engines, reporting layers, and AI/ML platforms built on historical tables. In the past, there has been a tight coupling of tables (storage) and query engines. In recent years, we’ve witnessed the rise of “headless” data infrastructure where companies are building a more open lakehouse in cloud object storage that is accessible by any number of tools.
Just like the Apache Kafka API has evolved to be the de facto open standard for data streaming, we’re seeing Apache Iceberg evolve into the de facto open-table standard for large-scale datasets stored in lakehouses. We’ve seen its ecosystem grow with robust tooling and support from compute engines such as Apache Spark®, Snowflake, Amazon Athena, Dremio, Trino, Apache Druid®, and many others. We’ve also seen a large uptick in its adoption within our own user base, but we are open to supporting other formats if there is significant demand. For those who are currently interested in other table formats, Apache XTable (incubating) is a solution that may work for cross-generating metadata. We believe the rise of open-table formats and the “headless” data infrastructure is being driven by the needs of data engineers evolving beyond the tight coupling of table to computing platform. These factors made Apache Iceberg support a natural first choice for us.
The one thing that naturally ties the operational and analytical estates together is that typically the same data needs to be accessible between them. The challenge here is that they usually want to access the same data in different ways. Organizations have been doomed to do the same work over and over again, joining apps together and ETLing data into lakehouses in a never-ending war against complexity.
Streaming operational data with Kafka to a lakehouse is often constructed with a bunch of complex and brittle jobs that require:
This is a lot of engineering time, compute, and cost just to get one stream into a raw but usable state in your lakehouse. In addition to these factors, there is still business-specific logic, clean-up, and loading of the data into downstream tables to be served to end users. We believe we can do much better here.
Confluent data is already directly represented in object storage. This means there is traditionally a fair amount of data duplication between the stream and the lakehouse. So rather than forcing it to be manually read back and copied into another S3 bucket in Iceberg format, what if we just unified these two things?
Tableflow allows Confluent users to easily materialize their Kafka topics, and associated schemas, into Apache Iceberg tables. Our goal with Tableflow is to make it as easy as possible to feed your lakehouse.
Tableflow uses innovations in our Kora Storage Layer that give us the flexibility to take Kafka segments and write them out to other formats, in this case, parquet files. Tableflow also utilizes a new metadata materializer behind the scenes that taps into Confluent’s Schema Registry to generate Apache Iceberg metadata while handling schema mapping, schema evolution, and type conversions. There is no more need for manual mappings that break every time the upstream app sends something new. Data quality rules are enforced upstream as part of the contract of the stream itself—incompatible data is forbidden at the source and easily detected in development. Your data products flow through directly into your lakehouse and are accessible as both a stream and a table.
In addition to schema management, Tableflow also continuously compacts the small parquet files generated by constantly streaming data into larger files to help maintain good read performance.
Currently, users can access their Apache Iceberg tables via the Iceberg REST catalog. All you need to do is copy the Apache Iceberg REST catalog endpoint and use a Confluent Cloud API key and secret as credentials and pass them to an Iceberg-compatible compute engine. We’re also working on supporting integrations with popular catalog services such as AWS Glue and Polaris Catalog.
The first version of Tableflow stores the data in Confluent Cloud, however, we know that being able to store long-term data in customer-owned storage is important. In the near future, Tableflow will have the option to store the Apache Iceberg metadata and parquet files directly in your own object storage.
Unification of batch and stream processing unlocks significant value—organizations can simplify their data infrastructure, reducing the complexity and cost associated with maintaining separate systems.
Apache Flink has done an excellent job in unifying primitives between batch and streaming engines and Confluent Cloud takes full advantage of this. FlinkSQL lets users create stream processing jobs in a declarative way. Common workloads include joining multiple streams, data masking, deduplication, and many others. Let’s take a look at how this works with Tableflow.
Let’s create a simple table that contains some basic customer information such as customerID, first and last name, email, and credit card number and insert some records into it.
Creating a table conveniently creates a Kafka topic along with the associated schema in Schema Registry. Tableflow automatically takes the schema generated and applies it to the associated Apache Iceberg table.
To make this data accessible in your lakehouse and also mask the credit card field before making it accessible to end users, you can use one of our new Flink Actions to easily create the workflow and land it in a topic of your choice.
As the data is flowing into your source table, Flink Actions automatically masks that credit card field and inserts it into the destination table Tableflow is enabled on, so it’s immediately accessible by Apache Iceberg compatible tooling.
In addition to making stream processing easy, our Flink service knows exactly where the data is so any query can transparently run across both paradigms, whether it’s the freshest data in stream or historical data sitting in Apache Iceberg tables. Not only do you not have to think about the distinction between real-time and historical data, but this also makes once-tedious tasks like reprocessing historical data much easier.
We’re excited about what the launch of Tableflow and the unification of batch and streaming data means for both your data streaming platform (DSP) and your LakeHouse:
For the Lakehouse it means that data is always fresh—it arrives, is processed, and is populated in real-time, ready for immediate queries.
For the DSP, your stream processing jobs have access to the full historical dataset so that reprocessing of old data or joins becomes much easier
Furthermore, we’re simplifying the data ingest pipeline and making it less fragile and cumbersome so that any data product you define will show up in your lake with no manual translation needed.
As mentioned before, Tableflow is a massive step towards unifying the operational and analytical divide in organizations. We’re making it easier than ever to feed data into your data lake, warehouse, or analytics engine—and our partners have taken notice.
At Kafka Summit London, we announced our Tableflow vision with a set of launch partners such as Snowflake, Amazon Athena, Dremio, Imply, Starburst, OneHouse, and more. These partners have already built heavily around Iceberg, and Tableflow represents an incredibly convenient way of getting data into their platforms.
It’s early days for Tableflow but we have ambitious goals and can’t wait for you to try it out. Tableflow is currently in private early access. If you’re interested in learning more or trying this out, please apply here.
Ready to get started? Sign up for a free trial of Confluent Cloud to explore all of our new features. New sign-ups receive $400 to spend within Confluent Cloud during their first 30 days. Use the code CL60BLOG
for an additional $60 of free usage.*
This blog announces the general availability of Confluent Platform 7.8 and its latest key features: Confluent Platform for Apache Flink® (GA), mTLS Identity for RBAC Authorization, and more.
We covered so much at Current 2024, from the 138 breakout sessions, lightning talks, and meetups on the expo floor to what happened on the main stage. If you heard any snippets or saw quotes from the Day 2 keynote, then you already know what I told the room: We are all data streaming engineers now.