Failure Modes and Edge Cases
Batch Specific Failures
Staleness is the classic batch failure. Your system precomputes recommendations overnight, then a flash sale starts at noon. Precomputed predictions recommend out of stock items for hours until the next batch run. Revenue lost, user experience degraded. Mitigation: either shorten batch cycles (expensive, diminishing returns) or add a lightweight online filter to remove unavailable items. The pattern is: batch generates candidates with known staleness, online applies fresh filters.
The Straggler Catastrophe
Data skew causes a few partitions to take 10x longer than the median. Maybe 99% of users finish in 1 hour, but the 1% with massive interaction histories take 10 hours. Your job completion time is the slowest partition. The batch window misses its cutoff. Downstream systems see partial writes: a mix of yesterday's predictions and today's for different users.
Production solution: use speculative execution (launch duplicate tasks for slow partitions) or cap per entity work (process only last N interactions). Snapshot semantics help: write to predictions_v124, validate coverage, then atomically switch consumers.
Real-time Tail Latency Spikes
Cold starts destroy p99 latency. A new instance takes 5 to 30 seconds to load a large model into memory. During this window, requests either time out or queue up, causing cascading delays. When the instance finally comes online, it processes the backlog but every request has already breached Service Level Objectives (SLOs). Solution: maintain warm pools. Keep a fraction of capacity always loaded and ready. Yes, you pay for idle capacity, but you buy p99 latency protection. Autoscaling helps for gradual traffic growth but cannot save you from sudden spikes.
Feature Unavailability at Inference Time
Training serving skew is the nightmare scenario. Your model trains on batch features computed with 24 hour aggregation windows. At serving time, the feature store has a 2 minute replication lag and returns stale or missing features. Model accuracy drops 20% in production compared to offline validation.
This is especially painful for real-time systems. A fraud model expects the transactions_last_hour feature. The feature store has a 5 minute lag. Fresh fraud patterns slip through because the model cannot see recent activity.
Mitigation: lock down feature definitions with schema validation. Compute identical transformations offline and online. Monitor feature freshness and coverage. Have fallback logic when features are missing: default values, cached last known good, or degrade to a simpler model.
Feedback Loops in Real-time Systems
Real-time decisions change the data distribution immediately. A recommendation model that always shows popular items makes them more popular, creating a filter bubble. New or niche content never gets exposure, utility collapses over time. Another example: fraud models that block too aggressively train on their own decisions. Legitimate users get blocked, cannot complete transactions, so the model never sees the negative feedback. The system becomes increasingly aggressive. Solution: inject exploration. Reserve 5 to 10% of traffic for random or diverse recommendations. Cap the feedback loop: limit how much one decision can influence future training data.
The Version Skew Problem
Hybrid systems have two inference paths: batch and online. When you deploy a new model, the batch job starts using version 2 immediately. But the online ranker still expects version 1 feature schema. Predictions become nonsensical until both paths align. This gets worse with multiple models in a pipeline. Candidate generator uses model version N, ranker uses version N+1, both expect different feature schemas. Debugging is a nightmare because every component looks correct in isolation. Production solution: version everything. Prediction schema, feature schema, model version. Enforce that batch and online paths read from the same versioned feature store snapshots. Deploy with atomic cutover: both paths switch to new version simultaneously, with instant rollback if metrics degrade.
Time To Live and Cache Invalidation
Batch predictions have a Time To Live (TTL). Set it too long and you serve stale predictions. Set it too short and you have cache misses, forcing expensive recomputation. Worse, if your online ranker TTL is 10 minutes but candidate cache TTL is 1 hour, you re-rank the same stale candidates. A user sees: recommended video list changes every 10 minutes but includes videos you already watched because the candidate set is 1 hour old. Confusing and broken experience.