Failure Modes: What Breaks at Scale

At scale, both batch and streaming encounter subtle failure modes that expose the limits of their guarantees. Understanding these edge cases is critical for system design interviews.

Streaming Failure Mode: Backpressure and Lag Explosion: A traffic spike can increase input events per second by 10x. During a major product launch or cultural event, you might see 5 million events per second where you normally handle 500,000. If stream consumers cannot keep up, lag grows: the offset distance between what has arrived and what has been processed.

As lag grows, end to end latency blows past Service Level Agreements (SLAs). Your p99 latency might jump from 500 milliseconds to 5 minutes. Worse, state size can explode. If you maintain per user counts in a 5 minute window, but processing falls 30 minutes behind, you are now holding state for 6x the intended time range. Memory usage spikes, garbage collection pauses increase, and the system can enter a death spiral where it slows down further.

❗ Remember: Backpressure is not just a performance issue. It is a cascading failure risk. Stream consumers falling behind trigger alerts, but recovery requires either scaling up rapidly (expensive) or shedding load (data loss).
Solution Patterns: Auto scaling with lookahead: monitor lag and scale consumers before it becomes critical. Graceful degradation: if lag exceeds 2 minutes, switch to sampling (process every 10th event) to catch up. Separate critical and non critical streams: fraud detection gets dedicated high priority resources, analytics can lag.

Streaming Failure Mode: Duplicate Events and Exactly Once Illusion: At least once delivery semantics mean retries can cause duplicate processing. A network glitch might cause the same purchase event to be processed twice. If your consumer increments a revenue counter without deduplication, metrics drift over time.

Achieving exactly once semantics requires transactional coordination between the message bus, the processor state, and the output sink. This adds latency and complexity. Many systems settle for at least once processing with idempotent sinks: writing to a database with unique key constraints so duplicates are ignored.

Duplicate Event Timeline
NORMAL
0.1%
→
OUTAGE
3-5%
→
METRICS DRIFT
+5%
Batch Failure Mode: Skew and Straggler Tasks: In batch processing, data is partitioned across many workers. If one partition contains 10x more data than others, that worker becomes the bottleneck. A job that should finish in 1 hour can take 6 hours because 99% of workers finish quickly but a few stragglers dominate.

Skew often comes from hot keys: one popular user generating 1 million events while typical users generate 100. Geographic concentration is another source: one region with 50% of global traffic creates imbalanced partitions.

Solution Patterns: Salting involves adding random suffixes to hot keys to spread them across partitions. Speculative execution launches duplicate tasks for slow workers and uses whichever finishes first. Adaptive partitioning reshapes data based on observed distribution.

Batch Failure Mode: Backfill Cascades: When you change business logic, you must reprocess historical data. If your storage and compute are sized for daily workloads, a 6 month backfill can take weeks and block other jobs. Worse, if the backfill itself fails partway through, you might have partial inconsistent state.

Cross Cutting Failure: Network Partitions and Regional Outages: Both streaming and batch rely on distributed components. A network partition between your message bus and stream consumers can cause data loss or divergence. Regional outages in cloud providers can make entire partitions of data unavailable.

The hard question: how do you reconcile streaming and batch views when they diverge? Most systems designate batch as the source of truth and periodically reconcile streaming outputs against batch results, correcting drift.

💡 Key Takeaways

✓Traffic spikes increasing events per second by 10x cause lag to grow, state to explode (5 minute windows holding 15 minutes of data), and p99 latency to jump from 500ms to 5 minutes

✓At least once delivery causes 0.1% duplicate rate normally, rising to 3 to 5% during outages; without idempotent sinks, metrics drift upward by that percentage

✓Batch skew from hot keys (one user with 1M events vs typical 100) extends job completion from 1 hour to 6 hours as stragglers dominate overall runtime

✓Backfills reprocessing 6 months of data on systems sized for daily workloads can take weeks and block other jobs, risking partial inconsistent state on failure

✓Network partitions and regional outages cause divergence between streaming and batch views; reconciliation requires designating batch as source of truth and periodic correction

📌 Interview Tips

1During Super Bowl, streaming system at 5M events/sec falls 30 minutes behind, causing state size to grow 6x and triggering auto scaling that costs $10k in emergency compute

2Payment processor without deduplication sees 3% duplicate rate during network issues, inflating revenue metrics by $50k before reconciliation with batch results corrects them

3Machine learning training backfill over 6 months on cluster sized for daily jobs takes 3 weeks instead of planned 2 days, blocking model release

← Back to Batch vs Stream Processing Trade-offs Overview