Failure Modes and Edge Cases in Hybrid Systems

Where Hybrid Architectures Break:

Hybrid systems introduce failure modes that don't exist in pure batch or stream architectures. Understanding these edge cases is critical for production reliability.

Double Counting and Gaps at Boundaries:

The serving layer must merge batch output up to time T with streaming output after T. If the watermark or cutover time is miscomputed by even a few seconds, you face two catastrophic scenarios.

First, double counting: both batch and stream include the same events. Your revenue dashboard suddenly shows 180 percent of actual revenue. Second, gaps: neither layer includes events in the boundary window. Revenue appears to drop to zero for that period.

❗ Remember: Production systems must use explicit cutover times stored in metadata and implement idempotent upserts keyed by event_id to make overlapping writes safe.

This isn't theoretical. A payment processor discovered they were double billing 0.3 percent of transactions because their batch job's end timestamp overlapped with the streaming window by 90 seconds. At $2 billion monthly volume, that's $6 million in erroneous charges monthly.

Late and Out of Order Events:

User devices go offline. Network partitions occur. You will see 1 to 5 percent of events arrive more than 30 minutes late in most production systems.

Streaming paths use watermarks with allowed lateness windows, typically 15 to 60 minutes. Events arriving after the window closes present a choice. You can feed them into a corrections stream that updates both streaming and batch outputs, adding complexity. Or you handle them purely in the next batch recomputation, meaning your recent dashboards are slightly wrong until corrected hours later.

Incident Timeline Example
NORMAL
1% late
→
OUTAGE
35% late
→
RECOVERY
6 hours

One social platform saw late event rates spike from 1 percent to 35 percent during a mobile network outage affecting 12 million users. Their streaming metrics were off by 40 percent for 6 hours until batch jobs caught up. If they hadn't explicitly handled this in their Service Level Agreement (SLA), stakeholders would have panicked at the apparent traffic drop.

Schema Evolution Divergence:

Batch and streaming jobs are often owned by different teams, deployed through separate continuous integration/continuous deployment (CI/CD) pipelines. A schema change like splitting user_location into city and country gets applied to batch logic but not streaming.

The serving layer merge breaks silently. Aggregates by location from the two paths use incompatible schemas and produce nonsense results. Worse, this often isn't caught immediately because the system keeps running.

Production mitigation requires schema registries with versioning, automated compatibility tests between batch and stream outputs, and alerting on metric divergence between the two paths.

Reprocessing at Scale:

When you discover a bug in event generation or processing logic, you need to reprocess days or weeks of data. In Kappa or unified systems, this means replaying billions of events.

The danger: your online systems aren't rate limited for historical replay. A streaming job that normally processes 100,000 events per second suddenly hits a downstream feature store with 10 million events per second during replay.

"Replay storms are the distributed systems equivalent of a thundering herd. Your feature store melts, taking down production inference with it."

Production patterns include throttled replay with backpressure, isolated reprocessing clusters that write to staging stores, and validation gates before promoting reprocessed data to production.

Resource Contention:

Batch jobs scanning petabytes can saturate storage bandwidth at 50 to 100 gigabytes per second. If they run concurrently with latency sensitive streaming jobs reading from the same storage or metadata services, you see streaming p99 latency spike from 50 milliseconds to 2 seconds.

This breaks Service Level Objectives (SLOs) for real-time use cases. The solution requires isolation: workload aware schedulers, storage Quality of Service (QoS) with quotas, or physically separated storage tiers for batch and stream.

💡 Key Takeaways

✓Cutover time misalignment causes double counting or gaps: a 90 second overlap at $2 billion monthly volume creates $6 million in erroneous charges, requiring idempotent upserts keyed by event identifier

✓Late arrivals spike from 1 percent to 35 percent during network outages, causing 40 percent metric divergence for hours until batch reconciliation catches up, demanding explicit SLA definitions

✓Schema evolution applied to only one path (batch or stream) silently breaks merge logic, requiring schema registries with versioning and automated compatibility tests between outputs

✓Reprocessing without throttling creates replay storms that overwhelm downstream systems at 100x normal rate, melting feature stores and breaking production inference

📌 Interview Tips

1Payment processor double billed 0.3 percent of transactions ($6 million monthly) because batch end timestamp overlapped streaming window by 90 seconds

2Social platform with 12 million affected users saw late event rate jump to 35 percent during mobile outage, streaming metrics understated traffic by 40 percent for 6 hours

3Replay of 7 days of events hit feature store at 10 million events per second versus normal 100,000 per second, causing cascading failures in model serving layer

← Back to Hybrid Batch-Stream Processing Overview