Model Serving & InferenceBatch vs Real-time InferenceMedium⏱️ ~3 min

Batch vs Real-Time Inference: Core Trade-offs and When to Use Each

Batch inference and real-time inference optimize for fundamentally different objectives. Batch inference maximizes throughput and unit cost efficiency by processing large datasets on a schedule, whether hourly, daily, or weekly. The Service Level Agreement (SLA) is measured in job turnaround time that spans minutes to hours, not per request latency. These workloads are embarrassingly parallel and bursty: you spin up thousands of cores or Graphics Processing Units (GPUs) for a short window to clear billions of records, then shut everything down to zero. Real-time inference optimizes for tail latency and freshness. The SLA is per request p95 or p99 latency, typically ranging from 5 to 100 milliseconds for traditional machine learning models. For Large Language Model (LLM) first token generation, this can extend to a few hundred milliseconds. These systems must remain always on, absorb diurnal traffic patterns and event driven spikes, and handle cascading dependencies like feature reads, candidate retrieval, and model scoring within a strict latency budget. Most production machine learning uses a hybrid approach. You compute heavy, slow to change signals and candidate sets offline in batch jobs, then do lightweight contextualization or re-ranking online. The central engineering decision boils down to the marginal value of freshness versus the marginal cost and operational complexity of serving live models. A useful framing: choose the most relaxed freshness requirement that still meets your business outcomes, then engineer the simplest system that delivers it.
💡 Key Takeaways
Batch inference optimizes for throughput and cost, processing billions of records with 3 to 10x better GPU utilization than real-time, but predictions become stale within hours
Real-time inference optimizes for tail latency and freshness, keeping p99 latency under 100 milliseconds for traditional models, but costs 10 to 100x more per prediction due to always on capacity
Choose batch when acceptable freshness is measured in hours or days, such as weekly churn propensity lists or daily recommendation candidate generation at Netflix scale
Choose real-time when per interaction value is high and wrong or late decisions incur immediate loss, such as payment fraud gating that must decide within 50 milliseconds or ad auctions with 100 millisecond budgets
Most production systems use hybrid architectures: precompute heavy candidate sets and embeddings offline, then re-rank with lightweight context online, as seen in YouTube and LinkedIn feed ranking
📌 Examples
Netflix precomputes top N recommendation rows per member daily in batch (processing 200M+ users), then re-ranks online with session context in under 100 milliseconds
Stripe fraud detection scores transactions in real-time with p99 latency under 50 milliseconds (handling 5,000 to 50,000 queries per second during peaks), while batch jobs nightly update risk aggregates and velocity features
OpenAI batch API accepts large jobs with up to 24 hour completion windows, offering 50% cost savings compared to real-time API by maximizing GPU batching and utilization
← Back to Batch vs Real-time Inference Overview