How Lambda Architecture Works at Scale

The Lambda Design Pattern:

Lambda architecture is the most common hybrid approach because it cleanly separates concerns. You maintain two independent processing paths, then merge their outputs in a serving layer.

1
Raw Data Log: All events flow into an immutable, append only log. This is your single source of truth. Events are simultaneously written to long term storage like object storage for batch access.
2
Batch Layer: Periodic jobs, running hourly or daily, read from storage and recompute the entire dataset from scratch. They build partitioned fact tables and materialized aggregates with complete correctness.
3
Speed Layer: Streaming jobs consume from the real-time log and maintain incremental state using event time processing and watermarks. They handle rolling windows and near real-time aggregates.
4
Serving Layer: A query interface reads authoritative batch data up to cutoff time T, then overlays streaming results from T to now. Users see a unified time series.
Production Scale Example:

Consider a typical large scale implementation handling terabytes daily. Your ingestion layer partitions events across 10 to 20 partitions per million events per second to achieve horizontal scalability.

The batch path runs MapReduce style jobs that might take 30 minutes to several hours, processing tens of gigabytes per second across your cluster. These jobs scan partitioned data like events/year=2024/month=01/day=15 and compute authoritative aggregates such as daily revenue per region or 30 day retention cohorts.

Processing Latency Targets
<1 sec
STREAM P50
<5 sec
STREAM P99
30+ min
BATCH JOBS

The streaming path feeds online monitoring dashboards for Site Reliability Engineering (SRE) teams, real-time alerting systems, and online feature stores for machine learning models that need fresh signals within seconds.

The Merge Logic:

Your serving layer implements time based partitioning. For a query requesting the last 24 hours of data, it reads 23 hours and 45 minutes from the batch store (authoritative precomputed results) and the final 15 minutes from the real-time store (incremental streaming results).

Some production systems materialize this merged view periodically to reduce query complexity. Others compute the merge on the fly using union queries with appropriate time filters.

💡 Key Takeaways

✓Batch layer is recomputational: it rebuilds outputs from scratch for a time range rather than applying deltas, which simplifies correctness and naturally handles late arriving corrections

✓Stream layer maintains incremental state with event time semantics and watermarks, allowing it to handle out of order events that arrive within an allowed lateness window

✓The cutoff time T between batch and streaming must be explicitly managed with idempotent upserts keyed by event identifier to prevent double counting or gaps

✓Lambda effectively doubles your infrastructure: separate code paths, schemas, and operational concerns for batch and stream, requiring careful coordination to prevent drift

📌 Interview Tips

1Ads platform with 1 billion daily active users processes 5 million events per second at peak, using streaming for 500 millisecond p99 fraud detection and batch for billing accuracy

2Batch jobs partition by <code style="padding: 2px 6px; background: #f5f5f5; border: 1px solid #ddd; border-radius: 3px; font-family: monospace; font-size: 0.9em;">event_date</code> and <code style="padding: 2px 6px; background: #f5f5f5; border: 1px solid #ddd; border-radius: 3px; font-family: monospace; font-size: 0.9em;">customer_id</code>, computing aggregates like 30 day retention or monthly revenue with full scan accuracy

3Serving layer query: SELECT sum(revenue) WHERE timestamp BETWEEN batch_store (now minus 24h to now minus 15m) UNION streaming_store (now minus 15m to now)

← Back to Hybrid Batch-Stream Processing Overview