Big Data Systems • Lambda & Kappa ArchitecturesHard⏱️ ~2 min
Failure Modes: Dual Pipeline Drift and Reprocessing Blast Radius
Dual pipeline drift is Lambda Architecture's primary failure mode where batch and speed layer logic diverge over time, causing user visible inconsistencies. Schema changes, business logic fixes, and feature additions get applied to one pipeline but not the other, or applied differently. Users see mismatched counts between real time dashboards showing 1.2 million events and next day batch reports showing 1.15 million, eroding trust in the data. The issue compounds because speed layer outputs are provisional, so when batch finally corrects them, metrics flip flop causing confusion and false alerts.
Mitigation requires treating transformation logic as shared contracts. Use shared libraries or specification driven code generation so both pipelines execute identical transformations. Implement contract tests that sample time windows and compare speed output against batch recompute on the same events, alerting on divergence above thresholds like 0.1 percent difference. Design deterministic reconciliation rules where batch results with higher epoch versions explicitly retract stale speed records using watermark based cutovers, preventing overlaps and gaps.
Reprocessing blast radius in Kappa Architecture occurs when replaying months of historical data overwhelms compute and I/O capacity. A bug fix requiring replay of 90 days of events at full throughput can take days to complete, hogging cluster resources and throttling live processing. Downstream systems get flooded with backfill traffic, causing cascading failures. If log retention was insufficient or compaction lost tombstones, reprocessing becomes impossible or produces duplicates. The mitigation is maintaining tiered storage for long retention measured in weeks to months, throttling and isolating replays to separate consumer groups writing to shadow materialized views, and capacity planning for 1.5 to 3 times headroom to absorb replay load without impacting live service level objectives (SLOs).
💡 Key Takeaways
•Dual pipeline drift causes user visible inconsistencies when batch and speed logic diverge, showing 1.2 million events in real time dashboard but 1.15 million in next day batch report
•Shared transformation libraries and contract tests comparing speed versus batch on sampled windows catch divergence, alerting on differences above 0.1 percent threshold
•Epoch versioning with watermark based cutovers ensures batch results explicitly retract stale speed records, preventing double counts from overlapping windows
•Kappa reprocessing blast radius occurs when replaying 90 days of events takes days to complete, hogging compute and throttling live processing with cascading downstream failures
•Tiered storage for weeks to months retention, throttled replays to shadow views, and 1.5 to 3 times capacity headroom prevent reprocessing from violating live SLOs
📌 Examples
Lambda drift: schema change adds field to speed layer but batch still uses old schema, causing null values in batch output that don't appear in speed results, contract tests detect 15 percent mismatch
Kappa blast radius: bug fix requires replaying 60 days of 10 million events per day, replay at full speed takes 48 hours and saturates cluster I/O, live latency degrades from 500ms p99 to 5 second p99
Mitigation: spin up separate consumer group for replay writing to shadow table, throttle to 50 percent of live throughput, validate shadow against canaries, atomic switch after completion