Build Predictive Machine Learning with Flink | Workshop on Dec 18 | Register Now

Unify Streaming and Analytical Data with Apache Iceberg®, Confluent Tableflow, and Amazon SageMaker® Lakehouse

Écrit par

Earlier this year, we unveiled our vision for Tableflow to feed Apache Kafka® streaming data into data lakes, warehouses, or analytical engines with a single click. Since then, many customers have been exploring, experimenting with, and providing valuable feedback on Tableflow Early Access. Our teams have worked tirelessly to incorporate this feedback and are excited to bring Tableflow Open Preview to you in the near future.

Today, we are thrilled to offer a sneak peek into the significant progress we've made, particularly on the integration of Confluent Tableflow with Amazon SageMaker® Lakehouse, enabling seamless materialization and consumption of Iceberg tables within the AWS analytics ecosystem. This powerful integration empowers you to effortlessly bring streaming data from Kafka into your data lake in Apache Iceberg® format and make it readily available for AWS Analytics engines or open source tools to consume using Amazon SageMaker Lakehouse via AWS Glue Data Catalog.

The data landscape: streaming vs. analytical data

Modern enterprises often divide their data management and utilization into two distinct estates: operational estate and analytical estate.

The operational estate encompasses SaaS apps, ERPs, and other custom systems applications that support day-to-day business operations. The operational estate is where data streaming has really evolved to transact with customers in real time. Apache Kafka has become the standard for organizing and storing operational data as data streams.

The analytical estate consists of data warehouses, lakehouses, AI/ML platforms, and other custom batch workloads that support business analysis and reporting. In recent years, we’ve witnessed the rise of “headless” data infrastructure where companies are adopting open lakehouses in cloud object storage, accessible by a number of tools. Similar to how Apache Kafka has become the standard for data streaming, Apache Iceberg is emerging as the leading open-table standard for managing large-scale datasets in lakehouses.

Apache Iceberg: enabling data democratization at scale

Apache Iceberg brings the simplicity, reliability, and scalability of SQL tables to large-scale datasets in data lakes, warehouses, and lake houses. It also addresses the limitations of traditional data lakes by offering features such as expressive SQL for data management, supporting ACID guarantees for data consistency, schema evolution, data compaction, hidden partitioning, and time travel capabilities for past data access.

Tableflow: unifying the operational and analytical divide

It is essential to use the operational data that you collect in the operational estate with your analytical estate to perform analytics and generate business insights. Essentially, you want to streamline the process of feeding your operational data in Kafka into Apache Iceberg tables, so it’s ready to power analytics in data lakes or warehouses.

Feeding raw operational data from Kafka into data lakes and warehouses in Apache Iceberg format is a complex, expensive, and error-prone process which requires building custom data pipelines. In these custom data pipelines, you need to transfer data (using sink connectors), clean data, manage schema, materialize change data capture (CDC) streams, transform and compact data, and store it in Parquet and Iceberg table formats.

This intricate workflow demands significant effort and expertise to ensure data consistency and usability.

What if you could eliminate all the hassle and have your Kafka topics automatically materialized into analytics-ready Apache Iceberg tables in your data lake or warehouse? That’s precisely what Tableflow allows you to do.

Tableflow revolutionizes the way you materialize Kafka data into data lakes and data warehouses by seamlessly materializing Kafka topics as Apache Iceberg tables. With just a push of a button, you can transform and feed your Kafka data in Confluent Cloud into Iceberg tables in your data lake or data warehouse.

Tableflow eliminates the complexity of data transferring, clean up, and preparation by automating tasks like schematization, type conversions, schema evolution, CDC stream materialization, Iceberg metadata publishing to catalogs, and table maintenance.

Seamless integration of Confluent Tableflow and Amazon SageMaker Lakehouse 

The unification of operational and analytical estates demands unified data management and governance. Amazon SageMaker Lakehouse unifies data across Amazon S3 data lake and data warehouse, enabling seamless data access from Apache Iceberg-compatible AWS analytic engines and open source tools.

Tableflow seamlessly integrates with Amazon SageMaker Lakehouse, enabling the materialization of Kafka topics into Iceberg tables stored in S3, with AWS Glue Data Catalog serving as the Apache Iceberg Catalog. You can enable Tableflow on any Kafka topic from Confluent Cloud and use AWS Glue Data Catalog as the Iceberg catalog for materialized tables. Once the tables are materialized, you can query them using any AWS analytics engine, including Amazon Athena, Amazon Redshift, or Amazon EMR.

Since Amazon SageMaker Lakehouse natively integrates with many third-party data analytics and compute engines such as Apache SparkTM, Apache Flink®, and Trino, your tables can also be consumed by those tools as well. 

Tableflow and Amazon SageMaker Lakehouse in action 

The following walks through an example of using Tableflow with Amazon SageMaker Lakehouse.

Imagine you have operational data ingested into Confluent Cloud and stored in a Kafka topic. You now want to feed this data into your data lake, powered by S3 and using the Iceberg table format, so AWS Analytics services can use them to perform analytics and generate business insights.

The following use case demonstrates capturing a CDC stream from a PostgreSQL database into a Kafka topic using a source connector. This Kafka topic is then materialized as an Iceberg table using AWS Glue Data Catalog, enabling downstream analytics engines to consume the data.

To materialize Kafka data into Iceberg, simply enable Tableflow at the Kafka topic level and select S3 as the storage location. While setting up S3, you'll need to use AWS AssumeRole and establish the necessary provider integration within Confluent Cloud.

Since you plan to consume and manage these tables in AWS, you can use Amazon SageMaker Lakehouse as your unified data management platform for Iceberg tables.

To make the Tableflow-generated Iceberg tables accessible in Amazon SageMaker Lakehouse, configure AWS Glue Catalog integration in Tableflow.

With this setup, Tableflow automatically publishes Iceberg table metadata pointers to AWS Glue Data Catalog, enabling AWS Analytics services or third-party compute engines compatible with Amazon SageMaker Lakehouse to access these tables. Once the integration is complete, you can discover Iceberg tables materialized by Tableflow in AWS Glue Data Catalog.

From there, you can use your preferred AWS Analytics service to consume the Iceberg tables.

You can use Amazon Athena to run SQL queries on these Iceberg tables. By configuring Athena to use the AWS Glue Data Catalog as the Iceberg catalog, you can efficiently explore and analyze the data directly from your data lake.

If you're working with Amazon Redshift, you can configure it to integrate with AWS Glue Data Catalog as well. This setup allows Redshift to query the same Iceberg tables, enabling fast, scalable data processing and analysis across your datasets.

Alternatively, if you prefer Apache Spark for data processing, you can use Amazon EMR to query the Iceberg tables. By configuring your EMR cluster to access the Glue Data Catalog, Spark jobs can directly interact with the Iceberg tables, allowing you to perform advanced transformations and analytics on the data.

You can also use Confluent Cloud for Apache Flink® to perform real-time stream processing by shifting data processing to the left or closer to the data source. For example, we can build on our previous use case by cleansing and aggregating data before materializing it as an Iceberg table, streamlining complex and costly data preparation in downstream analytics systems.

In the following demo, we guide you through the steps to integrate Confluent Tableflow with Amazon SageMaker Lakehouse using AWS Glue Catalog integration.

In summary, this example demonstrates how Tableflow can seamlessly make your Kafka operational data available to your AWS analytics ecosystem with minimal effort, leveraging the capabilities of Confluent Tableflow and Amazon SageMaker Lakehouse.

Next steps

We’re thrilled about the future of integrating streaming more closely with the data lake on AWS and remain committed to our active involvement and contributions to the Apache Iceberg project and community.

Are you ready to discover how Tableflow seamlessly integrates with Amazon SageMaker Lakehouse to unify your operational and analytical data using Apache Iceberg®? We are currently offering Tableflow in early access. Apply now to join our early access program and be among the first to experience its powerful capabilities.

Apache®, Apache Flink®, Flink, the Flink logo, Apache Kafka®, Kafka, the Kafka logo, and Apache IcebergTM, are trademarks of Apache Software Foundation.

Amazon and all related marks are trademarks of Amazon.com, Inc. or its affiliates.

  • Kasun is a Senior Product Manager at Confluent, driving innovation in the Tableflow product. He has extensive expertise in data streaming and application integration and previously led product management of the Azure Event Hubs product at Microsoft. He is the author of gRPC: Up and Running, and Microservices for Enterprise books. He has also shared his insights as a speaker at popular conferences like Current, KubeCon, and GOTO.

Avez-vous aimé cet article de blog ? Partagez-le !