Build Predictive Machine Learning with Flink | Workshop on Dec 18 | Register Now

Introducing Tableflow

作成者 :

We’re excited to talk about our vision for Tableflow, which makes it push-button simple to take Apache Kafka® data and feed it directly into your data lake, warehouse, or analytics engine as Apache Iceberg® tables. Making operational data accessible to the analytical world is traditionally a complex, expensive, and brittle process and we believe we can do better to unify the operational and analytical estates.

Tableflow removes all this erroneous, duplicative work and helps convert Kafka topics and associated schemas to Iceberg tables in one click. This is central to our Confluent’s vision to build the world’s leading data streaming platform that fuels any operational and analytical workload with real-time data products. 

Our partners are just as excited about Tableflow. At Kafka Summit London,  we announced our Tableflow vision with a set of launch partners such as Snowflake, Amazon Athena, Dremio, Imply, Starburst, OneHouse, Tabular, and more. These partners have already built heavily around Iceberg, and Tableflow represents an incredibly convenient way of getting data into their platforms. 

Tableflow is currently in early access, but we’re excited to share more about why we're investing here, what we’re building, and where we’re going. 

The rise of Apache Iceberg in the analytical estate

We wanted to set some context since it might seem a little odd for the data streaming company to dive into the open-table space. Data in most organizations are typically split into two estates:

  • Operational estate: The SaaS apps, ERPs, custom applications, etc. that serve the needs of applications to transact with customers in real-time

  • Analytical estate: The data warehouses, lakehouses, AI/ML platforms, and other custom batch workloads that support business analysis and reporting

The operational estate is where streaming has really grown up, built up by custom applications, SaaS tools, and other business-critical use cases. The analytical estate is a little different. It’s also often fed by streams but is a collection of different warehouses, query engines, reporting layers, and AI/ML platforms built on historical tables. In the past, there has been a tight coupling of tables (storage) and query engines. In recent years, we’ve witnessed the rise of “headless” data infrastructure where companies are building a more open lakehouse in cloud object storage that is accessible by any number of tools.

Just like the Apache Kafka API has evolved to be the de facto open standard for data streaming, we’re seeing Apache Iceberg evolve into the de facto open-table standard for large-scale datasets stored in lakehouses. We’ve seen its ecosystem grow with robust tooling and support from compute engines such as Apache Spark®, Snowflake, Amazon Athena, Dremio, Trino, Apache Druid®, and many others. We’ve also seen a large uptick in its adoption within our own user base, but we are open to supporting other formats if there is significant demand. For those who are currently interested in other table formats, Apache XTable (incubating) is a solution that may work for cross-generating metadata. We believe the rise of open-table formats and the “headless” data infrastructure is being driven by the needs of data engineers evolving beyond the tight coupling of table to computing platform. These factors made Apache Iceberg support a natural first choice for us. 

Today’s operational and analytical divide

The one thing that naturally ties the operational and analytical estates together is that typically the same data needs to be accessible between them. The challenge here is that they usually want to access the same data in different ways. Organizations have been doomed to do the same work over and over again, joining apps together and ETLing data into lakehouses in a never-ending war against complexity. 

Streaming operational data with Kafka to a lakehouse is often constructed with a bunch of complex and brittle jobs that require: 

  1. Setting up the infrastructure to consume the data from Apache Kafka, including:

    1. Configuring consumer groups or connectors

    2. Ensuring the consumer groups are balanced and properly sized for the throughput and number of partitions in your topic

  2. Feeding the data through a series of jobs that:

    1. Convert data into a universally accessible format like parquet

    2. Hook into Schema Registry or governance tooling that understands the expected schema, evolves the schema if necessary, and handles type conversions.

  3. Constantly compacting and cleaning up the small files that are generated from continuous streaming data as they land in object storage to maintain acceptable read performance

  4. If the data is a change log, materializing and applying the changes so the data is more useful to downstream users

This is a lot of engineering time, compute, and cost just to get one stream into a raw but usable state in your lakehouse. In addition to these factors, there is still business-specific logic, clean-up, and loading of the data into downstream tables to be served to end users. We believe we can do much better here. 

Confluent data is already directly represented in object storage. This means there is traditionally a fair amount of data duplication between the stream and the lakehouse. So rather than forcing it to be manually read back and copied into another S3 bucket in Iceberg format, what if we just unified these two things? 

Tableflow: A massive step towards unifying the operational and analytical divide

Tableflow allows Confluent users to easily materialize their Kafka topics, and associated schemas, into Apache Iceberg tables. Our goal with Tableflow is to make it as easy as possible to feed your lakehouse.

Tableflow uses innovations in our Kora Storage Layer that give us the flexibility to take Kafka segments and write them out to other formats, in this case, parquet files. Tableflow also utilizes a new metadata materializer behind the scenes that taps into Confluent’s Schema Registry to generate Apache Iceberg metadata while handling schema mapping, schema evolution, and type conversions. There is no more need for manual mappings that break every time the upstream app sends something new. Data quality rules are enforced upstream as part of the contract of the stream itself—incompatible data is forbidden at the source and easily detected in development. Your data products flow through directly into your lakehouse and are accessible as both a stream and a table.

In addition to schema management, Tableflow also continuously compacts the small parquet files generated by constantly streaming data into larger files to help maintain good read performance. 

Currently, users can access their Apache Iceberg tables via the Iceberg REST catalog. All you need to do is copy the Apache Iceberg REST catalog endpoint and use a Confluent Cloud API key and secret as credentials and pass them to an Iceberg-compatible compute engine. We’re also working on supporting integrations with popular catalog services such as AWS Glue and Polaris Catalog. 

The first version of Tableflow stores the data in Confluent Cloud, however, we know that being able to store long-term data in customer-owned storage is important. In the near future, Tableflow will have the option to store the Apache Iceberg metadata and parquet files directly in your own object storage.

Unifying streaming and batch processing with our data streaming platform

Unification of batch and stream processing unlocks significant value—organizations can simplify their data infrastructure, reducing the complexity and cost associated with maintaining separate systems.

Apache Flink has done an excellent job in unifying primitives between batch and streaming engines and Confluent Cloud takes full advantage of this. FlinkSQL lets users create stream processing jobs in a declarative way. Common workloads include joining multiple streams, data masking, deduplication, and many others. Let’s take a look at how this works with Tableflow. 

Let’s create a simple table that contains some basic customer information such as customerID, first and last name, email, and credit card number and insert some records into it. 

Creating a table conveniently creates a Kafka topic along with the associated schema in Schema Registry. Tableflow automatically takes the schema generated and applies it to the associated Apache Iceberg table. 

To make this data accessible in your lakehouse and also mask the credit card field before making it accessible to end users, you can use one of our new Flink Actions to easily create the workflow and land it in a topic of your choice.

As the data is flowing into your source table, Flink Actions automatically masks that credit card field and inserts it into the destination table Tableflow is enabled on, so it’s immediately accessible by Apache Iceberg compatible tooling.

In addition to making stream processing easy, our Flink service knows exactly where the data is so any query can transparently run across both paradigms, whether it’s the freshest data in stream or historical data sitting in Apache Iceberg tables. Not only do you not have to think about the distinction between real-time and historical data, but this also makes once-tedious tasks like reprocessing historical data much easier. 

We’re excited about what the launch of Tableflow and the unification of batch and streaming data means for both your data streaming platform (DSP) and your LakeHouse:

  • For the Lakehouse it means that data is always fresh—it arrives, is processed, and is populated in real-time, ready for immediate queries.

  • For the DSP, your stream processing jobs have access to the full historical dataset so that reprocessing of old data or joins becomes much easier

Furthermore, we’re simplifying the data ingest pipeline and making it less fragile and cumbersome so that any data product you define will show up in your lake with no manual translation needed. 

Tableflow Partners and Next Steps

As mentioned before, Tableflow is a massive step towards unifying the operational and analytical divide in organizations. We’re making it easier than ever to feed data into your data lake, warehouse, or analytics engine—and our partners have taken notice. 

At Kafka Summit London, we announced our Tableflow vision with a set of launch partners such as Snowflake, Amazon Athena, Dremio, Imply, Starburst, OneHouse, and more. These partners have already built heavily around Iceberg, and Tableflow represents an incredibly convenient way of getting data into their platforms. 

It’s early days for Tableflow but we have ambitious goals and can’t wait for you to try it out. Tableflow is currently in private early access. If you’re interested in learning more or trying this out, please apply here.

Start building with Confluent Cloud

Ready to get started? Sign up for a free trial of Confluent Cloud to explore all of our new features. New sign-ups receive $400 to spend within Confluent Cloud during their first 30 days. Use the code CL60BLOG for an additional $60 of free usage.*

The preceding outlines our general product direction and is not a commitment to deliver any material, code, or functionality. The development, release, timing, and pricing of any features or functionality described may change. Customers should make their purchase decisions based upon services, features, and functions that are currently available.
 
Confluent and associated marks are trademarks or registered trademarks of Confluent, Inc.
 
Apache®, Apache Kafka®, and Apache Flink® are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by the Apache Software Foundation is implied by the use of these marks. All other trademarks are the property of their respective owners.

  • Marc Selwan is the staff product manager for the Kora Storage team at Confluent. Prior to Confluent, Marc held product and customer engineering roles at DataStax, working on storage and indexing engines for Apache Cassandra.

このブログ記事は気に入りましたか?今すぐ共有