Failure Modes and Production Challenges

Compaction Falling Behind: The most common production failure with Merge on Read tables happens when compaction cannot keep pace with ingestion. Log files accumulate, and queries must merge dozens of deltas with base files.

Concrete numbers: a query that normally scans base Parquet in 5 seconds might take 3 minutes when merging 50 log files. Compute costs spike as readers burn cycles on merge operations instead of simple columnar scans.

This happens when workloads scale 10x. A compaction schedule tuned for 2 TB per day might handle log accumulation fine, but at 20 TB per day it falls over without adjustments to parallelism, frequency, or resource allocation.

Compaction Lag Impact
NORMAL
5 sec
→
10 LOGS
30 sec
→
50 LOGS
3 min
Concurrency Conflicts: Hudi uses optimistic concurrency control with locking to coordinate multiple writers, compaction, and clustering. Misconfigured locks lead to write conflicts or partially applied operations.

At high scale with multiple parallel writers ingesting 100k records per second, you need correct coordination. Common failure: two writers try to update the same file group simultaneously. Without proper locking, one write can be lost or you get corrupt timeline state.

The solution requires distributed locking (often using DynamoDB, ZooKeeper, or Hive metastore) and careful serialization of conflicting operations.

Checkpoint Loss and Reprocessing: Incremental queries depend on consumers storing their last processed commit instant. If that checkpoint is lost or corrupted, the consumer faces a dilemma: rewind to a safe old instant (reprocessing days or weeks of data, causing duplicates downstream) or skip ahead (potentially missing critical updates).

Production systems store checkpoints in highly durable metadata stores like relational databases or DynamoDB with backup strategies. Downstream consumers must be designed for idempotent processing so that reprocessing the same commit multiple times produces correct results.

❗ Remember: A consumer that loses checkpoint and rewinds 1 week on a table with 10 GB daily changes will reprocess 70 GB instead of 10 GB, potentially overwhelming downstream systems and creating duplicate aggregations.
Late Arriving Data: When events arrive out of order or late, Hudi's default last write wins merge strategy can produce incorrect results. If your correctness depends on event time, not processing time, you must configure custom merge logic.

Example: An order update with event time 10:05 arrives after an update with event time 10:10. Default merging based on commit time would let the older event overwrite the newer one, producing stale data.

Active active databases replicating across regions face this constantly. The solution requires application level conflict resolution policies, often using vector clocks or event time based merging.

Metadata Bloat: Keeping long retention of historical commits enables time travel queries but can bloat metadata. Organizations sometimes set 30 or 90 day retention without realizing that listing commits or running cleanup operations slows dramatically.

At very large scale, timeline operations that took milliseconds at 100 commits might take seconds at 10,000 commits. The fix is tuning retention policies to balance time travel needs against metadata performance.

💡 Key Takeaways

✓Compaction lag is the top MOR failure mode. Query latency increases from 5 seconds to 3+ minutes when 50 log files accumulate per base file

✓Concurrency conflicts occur with multiple writers at high scale. Requires distributed locking via DynamoDB, ZooKeeper, or Hive metastore to prevent lost writes

✓Checkpoint loss forces consumers to reprocess historical data. Losing 1 week of checkpoints means reprocessing 70 GB instead of 10 GB daily delta

✓Late arriving out of order events can produce incorrect results with default last write wins. Requires custom event time based merge logic or vector clocks

✓Metadata bloat from long retention slows timeline operations. 10,000 commits can increase cleanup from milliseconds to seconds, requiring retention policy tuning

📌 Interview Tips

1Workload scales from 2 TB to 20 TB daily ingestion: existing compaction schedule falls behind, log files accumulate, and queries slow 36x

2Two parallel writers update the same file group without locks: one write is silently lost, causing missing records in downstream analytics

3Consumer checkpoint corrupted during storage failure: system rewinds to 7 day old instant, reprocessing 70 GB and creating duplicate aggregations downstream

← Back to Apache Hudi for Incremental Processing Overview