Data Lakes & Lakehouses • Apache Hudi for Incremental ProcessingEasy⏱️ ~2 min
What is Apache Hudi?
Definition
Apache Hudi (Hadoop Upserts Deletes and Incrementals) brings database like record level updates, deletes, and change tracking to data lakes stored in object storage like S3.
✓ In Practice: Think of Hudi as bringing database UPDATE and DELETE capabilities to your data lake, but keeping the cost and scale benefits of cheap object storage and columnar formats.
💡 Key Takeaways
✓Hudi enables record level upserts and deletes on data lake files by maintaining indexes that map primary keys to file locations
✓The timeline tracks all commits as immutable instants, providing snapshot isolation and enabling incremental queries that return only changed rows
✓Processing only changed data reduces compute from hours to minutes. Uber Eats cut a 12+ hour pipeline to under 4 hours while halving costs
✓Hudi works on standard object storage like S3 with columnar Parquet, avoiding expensive proprietary storage while gaining update capabilities
✓Queries can read the latest snapshot, read optimized base files tolerating some staleness, or pull incremental changes since last checkpoint
📌 Examples
1A 10 TB table with 5% daily changes: traditional approach rescans all 10 TB taking 10+ hours. Hudi processes only the 500 GB that changed
2Uber Eats menu table: 11 billion records, 500 million change daily. Pipeline time dropped from 12+ hours to under 4 hours with 50% cost savings
3OLTP database emits 50k to 200k write operations per second via CDC into Kafka. Hudi continuously upserts into S3 tables with minutes of latency