What is Batch vs Real-time Inference?
The Core Difference
Batch inference is like cooking meals in advance for the entire week. You spend a few hours on Sunday preparing everything, store it, and consume it later. Real-time inference is like ordering from a restaurant: you request exactly what you want right now, and it arrives in minutes.
How Batch Inference Works
You spin up a large fleet of compute resources, process millions or billions of records, write predictions to storage (often called a prediction store), then shut down. The predictions are keyed by entity like user_id or item_id with a Time To Live (TTL). Applications read these precomputed predictions later with zero compute on the hot path.
For example, a recommendation system might compute top 100 videos for each of 200 million users every night. That is 20 billion predictions written to storage. When a user opens the app, you simply look up their precomputed list.
How Real-time Inference Works
An always running service receives requests, loads the model, fetches features, runs inference, and returns predictions within strict latency budgets. Think payment fraud detection: when you click "Buy Now", the system must score the transaction in under 50 milliseconds to decide whether to approve or block it. The system must handle burst traffic, maintain low tail latency (p95/p99), and stay online 24/7. No precomputation, no storage lookup. Fresh prediction every single time.