Big Data Systems • Data Lakes & LakehousesHard⏱️ ~3 min
Unifying Batch and Streaming in Lakehouses
Lakehouses enable unified batch and streaming by treating unbounded streams as incremental commits to the same table abstraction. Instead of maintaining separate systems for real time and batch processing, a single lakehouse table serves as the sink for both micro batch streaming ingestion and large batch backfills, simplifying architecture and reducing data duplication.
The key mechanism is idempotent upserts based on primary keys. Streaming pipelines commit small batches (1 to 5 minute intervals) with duplicate detection windows to handle late arriving events and retries. Batch backfills write larger commits for historical data. Both use the same ACID commit protocol, ensuring consistency. Watermarks track event time progress to bound late data and drive Service Level Agreement (SLA) completeness checks. For example, a watermark policy might specify that data is complete for a given hour once the watermark advances 15 minutes past the hour boundary, allowing downstream consumers to trigger aggregations with bounded latency.
Uber's Apache Hudi deployment demonstrates this pattern at scale. Kafka streams ingest hundreds of TB per day across many tables with 5 to 15 minute end to end freshness. Merge on read mode enables fast upserts and deletes for Change Data Capture (CDC) use cases like General Data Protection Regulation (GDPR) compliance without rewriting entire files. Background compaction runs continuously to merge delta logs, keeping read performance predictable despite high write velocity. The unified table serves both near real time analytics queries and batch Machine Learning (ML) feature generation from the same data.
The primary tradeoff is balancing freshness against small files. Frequent commits reduce latency but generate many small files, degrading read performance and increasing metadata overhead. Batching to larger commits (5 minutes instead of 30 seconds) improves efficiency but delays freshness. Continuous compaction of the most recent partitions mitigates small files but consumes compute resources. Careful tuning of commit intervals, compaction schedules, and compute autoscaling is essential to maintain both low latency ingestion and efficient query performance at scale.
💡 Key Takeaways
•Unified table serves both streaming micro batches (1 to 5 minute commits) and batch backfills with same ACID protocol, eliminating separate real time and batch systems
•Idempotent upserts via primary keys plus deduplication windows handle late arriving events and retries, ensuring exactly once semantics despite failures
•Watermarks track event time progress to bound late data and trigger completeness (for example, hour complete 15 minutes after hour boundary), enabling downstream SLA driven aggregations
•Uber Apache Hudi: hundreds of TB per day ingestion, 5 to 15 minute end to end freshness, merge on read for CDC upserts, continuous compaction for predictable read performance
•Tradeoff freshness vs efficiency: 30 second commits give subsecond latency but create small files problem, 5 minute commits improve efficiency but delay freshness, continuous compaction consumes compute
📌 Examples
Uber lakehouse with Hudi: Kafka streams to lake tables, 5 to 15 minute freshness, petabyte scale, merge on read upserts for GDPR deletes, background compaction prevents read degradation
Streaming plus batch pattern: ingest clickstream events every 2 minutes for real time dashboards, run nightly backfill job to reprocess last 7 days with updated ML model, both write to same table with upsert semantics
Watermark based completeness: streaming job commits events with event_time, watermark policy declares hour H complete at event_time H plus 15 minutes, downstream aggregation job waits for completeness signal before computing hourly metrics