Designing Apache Hudi for Incremental Processing

« Current 2022

Back in 2016, Apache Hudi brought transactions, change capture on top of data lakes, what is today referred to as the Lakehouse architecture. In this session, we first introduce Apache Hudi and the key technology gaps it fills in the modern data architecture. Bridging traditional data lakes and warehouses, Hudi helps realize the Lakehouse vision, by bringing transactions, optimized table metadata to data lakes and powerful storage layout optimizations, moving them closer to cloud warehouses of today. Viewed from a data engineering lens, Hudi also plays a key unifying role between the batch and stream processing worlds, by acting as a columnar, server-less ""state store"" for batch jobs, ushering in what we call the incremental processing model, where batch jobs can consume new data, update/delete intermediate results in a Hudi table, instead of re-computing/re-write entire output like old-school big batch jobs.

Rest of talk focusses on a deep dive into the some of the time-tested design choices and tradeoffs in Hudi, that helps power some of the largest transactional data lakes on the planet today. We will start by describing a tour of the storage format design, including data, metadata layouts and of course Hudi's timeline, an event log that is central to implementing ACID transactions and concurrency control. We will delve deeper into the practical concurrency control pitfalls in data lakes, and show how Hudi's hybrid approach combining MVCC with optimistic concurrency control, lowers contention and unlocks minute-level near real-time commits to Hudi tables. We will conclude with code examples that showcase Hudi's rich set of table services that perform vital table management such as cleaning older file versions, compaction of delta logs into base files, dynamic re-clustering for faster query performance, or the more recently introduced indexing service that maintains Hudi's multi-modal indexing capabilities.

Presenter

Vinoth Chandar

Apache Software Foundation

Vinoth Chandar is the original creator & VP of the Apache Hudi project, which has changed the face of data lake architectures over the past few years. Vinoth has a keen interest in unified data storage/processing architectures. He drove various efforts around stream processing/Kafka at Confluent. In the past, Vinoth has built large-scale, mission-critical infrastructure systems at companies like Uber and LinkedIn.

Presenter

Ethan Guo

Onehouse

Ethan Guo is a Database Engineer at Onehouse, working on building and optimizing the next generation of Lakehouse. He's an Apache Hudi committer and PMC member, passionate about streaming processing and Lakehouse architecture. Previously, he was a Senior Software Engineer at Uber, building mobile observability and data pipelines for monitoring Uber's mobile network performance in production on a global scale.

Designing Apache Hudi for Incremental Processing

Presenter

Vinoth Chandar

Presenter

Ethan Guo

Related Links

How Confluent Completes Apache Kafka eBook

Leverage a cloud-native service 10x better than Apache Kafka

Confluent Developer Center

Spend less on Kafka with Confluent, come see how