Loading...
Model Serving & InferenceBatch vs Real-time InferenceHard⏱️ ~3 min

Failure Modes and Edge Cases

Batch Specific Failures: Staleness is the classic batch failure. Your system precomputes recommendations overnight, then a flash sale starts at noon. Precomputed predictions recommend out of stock items for hours until the next batch run. Revenue lost, user experience degraded. Mitigation: either shorten batch cycles (expensive, diminishing returns) or add a lightweight online filter to remove unavailable items. The pattern is: batch generates candidates with known staleness, online applies fresh filters. The Straggler Catastrophe: Data skew causes a few partitions to take 10x longer than the median. Maybe 99% of users finish in 1 hour, but the 1% with massive interaction histories take 10 hours. Your job completion time is the slowest partition. The batch window misses its cutoff. Downstream systems see partial writes: a mix of yesterday's predictions and today's for different users. Production solution: use speculative execution (launch duplicate tasks for slow partitions) or cap per entity work (process only last N interactions). Snapshot semantics help: write to predictions_v124, validate coverage, then atomically switch consumers.
Batch Job Completion Timeline
MEDIAN
1 hour
STRAGGLER
10 hours
Real-time Tail Latency Spikes: Cold starts destroy p99 latency. A new instance takes 5 to 30 seconds to load a large model into memory. During this window, requests either time out or queue up, causing cascading delays. When the instance finally comes online, it processes the backlog but every request has already breached Service Level Objectives (SLOs). Solution: maintain warm pools. Keep a fraction of capacity always loaded and ready. Yes, you pay for idle capacity, but you buy p99 latency protection. Autoscaling helps for gradual traffic growth but cannot save you from sudden spikes. Feature Unavailability at Inference Time: Training serving skew is the nightmare scenario. Your model trains on batch features computed with 24 hour aggregation windows. At serving time, the feature store has a 2 minute replication lag and returns stale or missing features. Model accuracy drops 20% in production compared to offline validation. This is especially painful for real-time systems. A fraud model expects the transactions_last_hour feature. The feature store has a 5 minute lag. Fresh fraud patterns slip through because the model cannot see recent activity. Mitigation: lock down feature definitions with schema validation. Compute identical transformations offline and online. Monitor feature freshness and coverage. Have fallback logic when features are missing: default values, cached last known good, or degrade to a simpler model. Feedback Loops in Real-time Systems: Real-time decisions change the data distribution immediately. A recommendation model that always shows popular items makes them more popular, creating a filter bubble. New or niche content never gets exposure, utility collapses over time. Another example: fraud models that block too aggressively train on their own decisions. Legitimate users get blocked, cannot complete transactions, so the model never sees the negative feedback. The system becomes increasingly aggressive. Solution: inject exploration. Reserve 5 to 10% of traffic for random or diverse recommendations. Cap the feedback loop: limit how much one decision can influence future training data.
❗ Remember: Failures in batch jobs degrade silently (stale predictions), but failures in real-time serving are immediately user visible (timeouts, errors). Design batch for idempotency and versioning. Design real-time for graceful degradation and circuit breaking.
The Version Skew Problem: Hybrid systems have two inference paths: batch and online. When you deploy a new model, the batch job starts using version 2 immediately. But the online ranker still expects version 1 feature schema. Predictions become nonsensical until both paths align. This gets worse with multiple models in a pipeline. Candidate generator uses model version N, ranker uses version N+1, both expect different feature schemas. Debugging is a nightmare because every component looks correct in isolation. Production solution: version everything. Prediction schema, feature schema, model version. Enforce that batch and online paths read from the same versioned feature store snapshots. Deploy with atomic cutover: both paths switch to new version simultaneously, with instant rollback if metrics degrade. Time To Live and Cache Invalidation: Batch predictions have a Time To Live (TTL). Set it too long and you serve stale predictions. Set it too short and you have cache misses, forcing expensive recomputation. Worse, if your online ranker TTL is 10 minutes but candidate cache TTL is 1 hour, you re-rank the same stale candidates. A user sees: recommended video list changes every 10 minutes but includes videos you already watched because the candidate set is 1 hour old. Confusing and broken experience.
💡 Key Takeaways
Batch staleness failures occur when real world state changes faster than batch refresh cycle (flash sales, breaking news, stock outages)
Straggler tasks from data skew dominate job completion time; 1% of partitions taking 10x longer delays entire batch, causing partial writes
Cold starts in real-time systems add 5 to 30 seconds of model loading time, destroying p99 latency and causing cascading queue buildup
Training serving skew from feature lag or schema mismatches drops model accuracy 20% in production, especially painful for real-time fraud detection
Feedback loops in real-time recommendations create filter bubbles; models trained on their own decisions become increasingly biased without exploration
📌 Examples
1Recommendation batch job with data skew: 99% of users finish in 1 hour, but 1% with massive histories take 10 hours, causing partial prediction writes
2Fraud model with 5 minute feature store lag misses fresh fraud patterns because <code>transactions_last_hour</code> feature is stale at inference time
3Version skew in hybrid system: batch uses model v2 with new feature schema while online ranker still expects v1, causing nonsensical predictions until both paths align
← Back to Batch vs Real-time Inference Overview
Loading...