What is Apache Hudi?

Definition
Apache Hudi (Hadoop Upserts Deletes and Incrementals) brings database like record level updates, deletes, and change tracking to data lakes stored in object storage like S3.
The Core Problem: Traditional data lakes store data in immutable Parquet files, which is perfect for batch analytics. But what happens when 5 percent of your 10 TB table changes daily? You face a brutal choice: either rescan and rewrite nearly 100 percent of the data (taking over 10 hours and wasting enormous compute), or accept stale data that's only refreshed weekly.

Neither option works when you need fresh analytics with minute to hour latency at reasonable cost.

How Hudi Solves This: Hudi adds a table abstraction on top of object storage that tracks every record using primary keys. Instead of treating files as immutable blobs, Hudi maintains an index that maps record keys to file locations and a timeline of all commits.

When new data arrives, Hudi looks up existing records using the index (often Bloom filters), finds which files contain them, and efficiently updates just those files. For that 10 TB table with 5 percent daily changes, you now process only the 500 GB that actually changed, not all 10 TB.

The Impact: At Uber Eats, a menu table with 11 billion records where 500 million change daily previously took over 12 hours to process with full scans. With Hudi incremental processing, the same pipeline completes in under 4 hours while cutting compute costs by 50 percent.

✓ In Practice: Think of Hudi as bringing database UPDATE and DELETE capabilities to your data lake, but keeping the cost and scale benefits of cheap object storage and columnar formats.

💡 Key Takeaways

✓Hudi enables record level upserts and deletes on data lake files by maintaining indexes that map primary keys to file locations

✓The timeline tracks all commits as immutable instants, providing snapshot isolation and enabling incremental queries that return only changed rows

✓Processing only changed data reduces compute from hours to minutes. Uber Eats cut a 12+ hour pipeline to under 4 hours while halving costs

✓Hudi works on standard object storage like S3 with columnar Parquet, avoiding expensive proprietary storage while gaining update capabilities

✓Queries can read the latest snapshot, read optimized base files tolerating some staleness, or pull incremental changes since last checkpoint

📌 Interview Tips

1A 10 TB table with 5% daily changes: traditional approach rescans all 10 TB taking 10+ hours. Hudi processes only the 500 GB that changed

2Uber Eats menu table: 11 billion records, 500 million change daily. Pipeline time dropped from 12+ hours to under 4 hours with 50% cost savings

3OLTP database emits 50k to 200k write operations per second via CDC into Kafka. Hudi continuously upserts into S3 tables with minutes of latency

← Back to Apache Hudi for Incremental Processing Overview