Loading...
Model Serving & Inference • Batch vs Real-time InferenceMedium⏱️ ~3 min
Batch Inference: Throughput Over Latency
The Batch Philosophy:
Batch inference says "I will wait to accumulate work, then blast through it all at once using massive parallelism." You are not optimizing for how fast one prediction completes. You are optimizing for how many predictions per dollar and how efficiently you use compute.
The Execution Model:
Batch jobs are embarrassingly parallel and bursty. You spin up thousands of Central Processing Unit (CPU) cores or Graphics Processing Unit (GPU) instances, partition your dataset across them, process everything in a coordinated window, write results, then shut down to zero.
Consider a recommendation system generating candidates for 500 million users. You partition users into 10,000 chunks of 50,000 each. Each worker loads the model once, streams through its chunk, and writes predictions to storage. The entire job runs for 90 minutes using 10,000 cores, then terminates. Total cost: compute time only, no idle capacity.
When Batch Wins:
Batch is ideal when utility decays slowly. Churn prediction for next month does not need to update every second. Weekly email campaign targeting can use predictions computed overnight. Content moderation backfills can run on 24 hour windows. The key question: does freshness matter enough to justify 5x to 20x higher cost?
Production Pattern: Prediction Store:
The standard architecture writes predictions to a key value store indexed by entity. Schema:
Large Language Model Batch Efficiency
3x to 10x
GPU UTILIZATION
24 hours
COMPLETION SLA
user_id → {prediction_scores, model_version, timestamp, ttl}. Applications read predictions by key lookup, never recomputing. This decouples inference cost from serving queries per second (QPS).
For example, YouTube might materialize top 1000 candidate video IDs per user daily. When you open the app, the service reads your precomputed list (1 Redis lookup, under 5ms), applies online filters, and returns results. Zero inference compute on the hot path.
The Straggler Problem:
A few partitions often dominate compute time due to data skew. Maybe 95% of users finish in 60 minutes, but 5% with massive histories take 3 hours. Your job completion time is the slowest partition. Mitigation: dynamic repartitioning, speculative execution for slow tasks, or capping per entity work.
⚠️ Common Pitfall: Partial materialization failures leave a mix of old and new predictions in storage. Consumers see inconsistent results. Use versioned snapshots: write to
predictions_v123, validate, then atomically flip consumers to the new version.💡 Key Takeaways
✓Batch maximizes throughput per dollar by using massive parallelism in short bursts, then shutting down to zero idle cost
✓Predictions are materialized in a prediction store keyed by entity, decoupling inference cost from serving QPS
✓Batch is ideal when freshness requirements are relaxed (hours to days) and utility decays slowly, like churn prediction or weekly targeting
✓Large Language Model batch workloads achieve 3x to 10x better GPU utilization compared to real-time serving, with completion SLAs up to 24 hours
✓Straggler tasks due to data skew can dominate job completion time; use dynamic repartitioning or speculative execution to mitigate
📌 Examples
1OpenAI and major cloud providers offer batch APIs with up to 24 hour completion windows for large Language Model jobs, trading latency for 50% cost reduction
2YouTube style systems materialize top 1000 candidate videos per user daily, enabling 5ms Redis lookups at serving time with zero inference compute
3Recommendation batch job: 10,000 workers process 500M users in 90 minutes, write 50B predictions to storage, then terminate to eliminate idle capacity cost
Loading...