A Glide, Skip or a Jump: Efficiently Stream Data into Your Medallion Architecture with Apache Hudi

« Current 2023

The medallion architecture graduates raw data sitting in operational systems into a set of refined tables in a series of stages, ultimately processing data to serve analytics from gold tables. While there is a deep desire to build this architecture incrementally from streaming data sources like Kafka, it is very challenging with current technologies available on lakehouses; a lot of technologies can’t efficiently update records or efficiently process incremental data without recomputing all the data to serve low-latency tables. Apache Hudi is a transactional data lake platform with full mutability support, including streaming upserts, and provides a powerful incremental processing framework. Apache Hudi powers the largest transactional data lakes in the industry, differentiating on fast upserts and change streams to only process and serve the change records.

To further improve the upsert performance, Hudi now supports a new record-level index that deterministically maps the record key to the file location orders of magnitude faster. As a result, Hudi speeds up computationally expensive MERGE operations even more by avoiding full table scans. On the query side, Hudi now supports database-style change data capture with before, and after images to chain flow of inserts, updates and deletes change records from bronze to silver to gold tables.

In this talk, attendees will walk away with:

The current challenges of building a medallion architecture at low-latency
How the record index and incremental updates work with Apache Hudi
How the new Hudi CDC feature unlocks incremental processing on the lake
How you can efficiently build a medallion architecture by avoiding expensive operations

Presenter

Nadine Farah

Onehouse

Nadine Farah is leading Onehouse's developer initiatives. She's passionate about bridging engineering, product & marketing to help drive product adoption. She previously led Rockset's developer initiatives, focusing on building technical content to drive developer adoption for real-time analytics. At Bose, she contributed to the watchOS SDK & worked with partners to embrace spatial audio in the music and gaming industries.

Presenter

Ethan Guo

Onehouse

Ethan Guo is a Database Engineer at Onehouse, working on building and optimizing the next generation of Lakehouse. He's an Apache Hudi committer and PMC member, passionate about streaming processing and Lakehouse architecture. Previously, he was a Senior Software Engineer at Uber, building mobile observability and data pipelines for monitoring Uber's mobile network performance in production on a global scale.

A Glide, Skip or a Jump: Efficiently Stream Data into Your Medallion Architecture with Apache Hudi

Presenter

Nadine Farah

Presenter

Ethan Guo

Related Links

How Confluent Completes Apache Kafka eBook

Leverage a cloud-native service 10x better than Apache Kafka

Confluent Developer Center

Spend less on Kafka with Confluent, come see how