Streaming Data Lakes using Kafka Connect + Apache Hudi

« Kafka Summit Americas 2021

Apache Hudi is a data lake platform, that provides streaming primitives (upserts/deletes/change streams) on top of data lake storage. Hudi powers very large data lakes at Uber, Robinhood and other companies, while being pre-installed on four major cloud platforms.

Hudi supports exactly-once, near real-time data ingestion from Apache Kafka to cloud storage, which is typically used in-place of a S3/HDFS sink connector to gain transactions and mutability. While this approach is scalable and battle-tested, it can only ingest data in mini batches, leading to lower data freshness. In this talk, we introduce a Kafka Connect Sink Connector for Apache Hudi, which writes data straight into Hudi's log format, making the data immediately queryable, while Hudi's table services like indexing, compaction, clustering work behind the scenes, to further re-organize for better query performance.

Presenter

Vinoth Chandar

Apache Software Foundation

Vinoth Chandar is the original creator & VP of the Apache Hudi project, which has changed the face of data lake architectures over the past few years. Vinoth has a keen interest in unified data storage/processing architectures. He drove various efforts around stream processing/Kafka at Confluent. In the past, Vinoth has built large-scale, mission-critical infrastructure systems at companies like Uber and LinkedIn.

Presenter

Balaji Varadarajan

Robinhood

Balaji Varadarajan is a Sr.Staff Engineer at Robinhood where he broadly oversees Robinhood’s data lake. He is also an Apache Hudi PMC member. Previously, he was a tech lead in Uber data ingestion team and one of the lead engineers on LinkedIn’s databus change capture system. Balaji’s interests lie in distributed data systems.

Streaming Data Lakes using Kafka Connect + Apache Hudi

Presenter

Vinoth Chandar

Presenter

Balaji Varadarajan

Related Links

How Confluent Completes Apache Kafka eBook

Leverage a cloud-native service 10x better than Apache Kafka

Confluent Developer Center

Spend less on Kafka with Confluent, come see how