Lambda Architecture Failure Modes and Edge Cases

Logic Divergence Between Layers:

The most insidious failure in Lambda systems is subtle divergence between batch and speed implementations. Imagine a revenue calculation that rounds differently: batch uses banker's rounding, speed uses standard rounding. For 5 million transactions per day with average value of 20 dollars, this creates a discrepancy of roughly 2,500 to 5,000 dollars daily. Users see one number in real time dashboards, then that number changes after midnight when batch results arrive.

This happens because the two paths are often implemented by different teams or using different languages. Batch might be SQL on a data warehouse, speed might be Java in a streaming framework. A schema change (adding a discount field to revenue calculations) gets deployed to streaming but forgotten in the batch SQL script. The divergence goes unnoticed until finance reports don't reconcile.

❗ Remember: In production, set up automated reconciliation that compares batch and speed aggregates on 10 to 20 key metrics every hour. Alert if variance exceeds 0.1 to 1 percent. This catches divergence before it reaches users.
Late Arriving Events:

Real world events arrive out of order and late. Mobile apps queue events offline and flush them when connectivity returns, sometimes hours later. A ride that happened at 2pm might not reach your system until 6pm due to the rider's phone being in airplane mode.

The speed layer typically maintains a sliding window, perhaps the last 6 hours. If that late event arrives 7 hours late, it falls outside the window and gets ignored by the speed layer. The batch layer will eventually account for it when it reprocesses the full day's data, but this creates temporary undercounting.

Late Event Timeline
EVENT TIME
14:00
→
ARRIVES
21:00
→
SPEED MISS
7h late

For systems where late data matters (financial, compliance), you need to either: extend the speed layer window to 24 to 48 hours (increasing memory requirements 4x to 8x), or accept that speed layer numbers are provisional and only batch results are authoritative. The latter is common: dashboards show "preliminary" for today's data and "final" once batch processing completes.

Speed Layer Outages:

What happens when your streaming cluster fails for 30 minutes during peak traffic? Events are still arriving at 200 thousand per second, being written to the durable log, but the speed layer isn't processing them. Users querying current state see stale data.

You have three options. First, serve only batch data with a warning that freshness is degraded ("showing data up to 1am, current time is 2:30pm"). Second, implement a fallback where queries against the speed layer timeout and automatically retry against slightly delayed but available replicated state. Third, design speed layer state to be quickly rebuildable: when the cluster recovers, it replays the missed 30 minutes of events at high throughput (perhaps 10x normal rate) to catch up in 3 to 5 minutes.

Netflix's approach for stream processing failures involves multiple independent speed layer clusters in different availability zones. If one fails, serving switches to another with 5 to 30 seconds of additional lag. This costs 2x to 3x on speed layer infrastructure but provides high availability for critical real time features.

Batch Processing Delays and Data Skew:

Batch jobs sometimes run long due to data skew or unexpected volume. A daily job that normally takes 35 minutes might take 2 hours if a single partition has 10x more data due to a viral event or bot traffic. This delays the cutoff time when serving switches from speed to batch results.

Users notice this as extended "provisional" periods. A dashboard that normally shows final numbers by 1am might show provisional numbers until 3am when the delayed batch job completes. For time sensitive use cases (payouts that must run by 2am), you need either: over provisioned batch clusters with 2x to 3x normal capacity headroom, or partition pruning and sampling that trades slight accuracy for consistent latency (process 95 percent of data on time, backfill the remaining 5 percent later).

Schema Evolution:

Adding a new field to events requires coordinating updates across the entire pipeline. Producers start emitting the new field. Both batch and speed consumers must handle events with and without the field (old events in the log don't have it, new events do). Batch jobs reading historical partitions encounter mixed schemas.

The safest pattern is: producers emit new field as optional, both consumers add handling for it with default values, wait one full retention period (30 to 90 days) for old events to age out, then make the field required. Skipping steps causes partial failures where one path processes the field correctly and the other doesn't, leading to divergence.

💡 Key Takeaways

✓Logic divergence between batch and speed layers is caught by running automated hourly reconciliation comparing 10 to 20 key metrics, alerting when variance exceeds 0.1 to 1 percent

✓Late events arriving beyond the speed layer window (commonly 6 to 24 hours) are missed in real time views but corrected when batch reprocessing runs, requiring "provisional vs final" labeling in dashboards

✓Speed layer outages during peak traffic (200k events/sec) require either serving stale batch data, fallback to replicas with 5 to 30 seconds extra lag, or fast replay to catch up in 3 to 5 minutes

✓Batch job delays due to data skew can extend "provisional" periods by 1 to 2 hours, requiring over provisioned clusters (2x to 3x capacity) or sampling strategies for consistent SLAs

📌 Interview Tips

1A ride sharing app discovers speed and batch disagree on daily revenue by 0.5 percent (about 50k dollars): investigation finds speed layer handles refunds immediately while batch layer defers to next day due to SQL logic difference

2During a viral event, one partition receives 10x normal traffic causing batch job to run 90 minutes instead of 35 minutes. Driver payout reports are delayed by 1 hour, requiring manual approval extension for automated processing deadline.

3Mobile events from users in airplane mode arrive 4 to 8 hours late, representing 2 to 3 percent of daily volume. Speed layer with 6 hour window misses these entirely; batch reprocessing the next day adds them, causing historical metrics to be revised upward.

← Back to Lambda Architecture Pattern Overview