Loading...
Model Serving & Inference • Batch vs Real-time InferenceHard⏱️ ~3 min
Batch vs Real-time: Making the Choice
The Fundamental Trade-off:
This is not about "better" or "worse." It is about marginal value of freshness versus marginal cost and operational complexity. Every second of reduced staleness has a cost. Every nine of availability in your Service Level Agreement (SLA) costs more.
Decision Framework: Four Questions:
First, what is acceptable freshness? If churn prediction for next month can use yesterday's model, batch wins. If payment fraud needs current transaction context, real-time is required.
Second, what is the per interaction value? Low value, high volume interactions (email recommendations, content feeds) favor batch. High value interactions (fraud gating, ad auctions where milliseconds equal dollars) justify real-time cost.
Third, what is your read to write ratio? Systems that are write heavy (over 80% writes) like event logs should minimize online compute. Systems that are read heavy (over 99% reads) like user profiles can afford online enrichment.
Fourth, can you decompose the problem? Most production systems do. Compute expensive embeddings and candidate sets offline. Do lightweight re-ranking and contextualization online.
The Hybrid Pattern: Best of Both:
Netflix style recommendations illustrate this perfectly. Offline batch computes top 1000 candidate videos per user daily using heavy models and collaborative filtering. This runs for hours using massive clusters. Online service reads the precomputed candidates (one Redis lookup, under 5ms), applies real-time filters (recently watched, device type, current session), and re-ranks with a lightweight model in under 100 milliseconds.
Total cost: batch runs once daily, online only pays for fast lookups and light models. Freshness: candidates refresh daily, contextualization is real-time. This is the pattern at YouTube, Pinterest, LinkedIn feeds: heavy lifting offline, last mile online.
Cost Reality Check:
Real-time serving can cost 5x to 20x more than batch for the same number of predictions. Why? You pay for peak capacity 24/7, not just the hours you are computing. Warm pools to avoid cold start penalties. Redundancy for availability. Networking and orchestration overhead.
Batch scales to zero. Spin up 10,000 cores for 2 hours, process 1 billion predictions, pay for 20,000 core hours, done. Real-time serving 1 billion predictions at 10,000 per second takes 100,000 seconds (28 hours), but you must provision for peak QPS and keep capacity running continuously.
When Hybrid Breaks Down:
Hybrid assumes stable batch components and volatile online context. This fails when the stable part becomes volatile. Example: news recommendation during breaking events. Precomputed candidates from this morning miss the story everyone wants now. You need either very frequent batch refreshes (every 15 minutes, expensive) or shift more logic online (complex).
Another failure: version skew. Online ranker expects feature schema version N+1 while batch produced N. Predictions become garbage. Mitigation: enforce version pinning and atomic rollouts.
Batch Inference
Cheap per prediction, hours to days stale
Best for: slow decay utility, population scoring
vs
Real-time Inference
Expensive, milliseconds to seconds fresh
Best for: fast decay utility, per-request context
"The decision is not 'can we afford real-time?' It is 'what is the minimum freshness that still achieves business outcomes?' Start with the most relaxed freshness, then tighten only where value justifies cost."
⚠️ Common Pitfall: Teams often default to real-time because it "feels modern" without quantifying the actual freshness requirement. Measure the business impact of 1 hour staleness versus 1 minute staleness. Often the difference is negligible but the cost difference is 10x.
💡 Key Takeaways
✓Decision framework: acceptable freshness, per interaction value, read to write ratio, and decomposability determine batch versus real-time choice
✓Hybrid pattern is the production standard: compute expensive embeddings and candidate sets offline, do lightweight contextualization and re-ranking online
✓Real-time serving costs 5x to 20x more than batch due to always on capacity, warm pools, redundancy, and provisioning for peak traffic
✓Batch scales to zero cost when idle; real-time requires continuous capacity even during low traffic periods
✓Choose the most relaxed freshness that achieves business outcomes, then tighten only where marginal value justifies marginal cost increase
📌 Examples
1Netflix computes top 1000 candidates per user daily in batch, then online service does 5ms Redis lookup plus lightweight re-ranking in under 100ms
2News recommendations during breaking events expose hybrid limits: precomputed candidates miss trending stories, requiring frequent batch refreshes or online candidate generation
3Ad auctions justify real-time cost because milliseconds of latency directly impact revenue; email campaigns tolerate 24 hour batch staleness with no business impact
Loading...