Loading...
Model Serving & InferenceBatch vs Real-time InferenceEasy⏱️ ~2 min

What is Batch vs Real-time Inference?

Definition
Batch inference processes large datasets on a schedule (hourly/daily) to generate predictions in bulk. Real-time inference generates predictions on demand in milliseconds for immediate use.
The Core Difference: Batch inference is like cooking meals in advance for the entire week. You spend a few hours on Sunday preparing everything, store it, and consume it later. Real-time inference is like ordering from a restaurant: you request exactly what you want right now, and it arrives in minutes. How Batch Inference Works: You spin up a large fleet of compute resources, process millions or billions of records, write predictions to storage (often called a prediction store), then shut down. The predictions are keyed by entity like user_id or item_id with a Time To Live (TTL). Applications read these precomputed predictions later with zero compute on the hot path. For example, a recommendation system might compute top 100 videos for each of 200 million users every night. That is 20 billion predictions written to storage. When a user opens the app, you simply look up their precomputed list. How Real-time Inference Works: An always running service receives requests, loads the model, fetches features, runs inference, and returns predictions within strict latency budgets. Think payment fraud detection: when you click "Buy Now", the system must score the transaction in under 50 milliseconds to decide whether to approve or block it. The system must handle burst traffic, maintain low tail latency (p95/p99), and stay online 24/7. No precomputation, no storage lookup. Fresh prediction every single time.
✓ In Practice: Most production systems use both. Compute expensive signals offline in batch, then do lightweight contextualization online. Netflix computes candidate videos in batch but re-ranks them in real-time using your current session.
💡 Key Takeaways
Batch inference optimizes for throughput and cost efficiency by processing large datasets on a schedule, with Service Level Agreements (SLAs) measured in job completion time (minutes to hours)
Real-time inference optimizes for tail latency and freshness, with SLAs measured in per request p95/p99 latency (typically 5 to 100ms for traditional models)
Batch predictions are materialized into a prediction store keyed by entity with TTL, consumed later with no compute on the hot path
Real-time systems must be always on, handle traffic spikes, and manage cascading dependencies within strict latency budgets
Most production ML uses hybrid: compute heavy signals offline in batch, do lightweight contextualization online to balance cost and freshness
📌 Examples
1Netflix computes top 100 candidate videos per user daily in batch (200M users × 100 videos = 20B predictions), then re-ranks online with session context in under 100ms
2Payment fraud detection scores transactions in real-time within 50ms to block or approve immediately, while nightly batch jobs update risk aggregates
3Ad auction bidders keep model scoring under 5 to 20ms to fit within 100ms exchange deadline, handling tens of thousands of queries per second
← Back to Batch vs Real-time Inference Overview
Loading...