Loading...
Model Serving & InferenceBatch vs Real-time InferenceMedium⏱️ ~3 min

Real-time Inference: Latency Under Pressure

The Real-time Challenge: Real-time inference means you have 5 to 100 milliseconds to fetch features, run a model, and return a prediction, at sustained queries per second with traffic spikes. Miss your latency budget and users see loading spinners, transactions time out, or ads fail to render. The math is unforgiving: at 10,000 QPS with 50ms p99 latency target, you need enough concurrency to absorb bursts without queueing. Concurrency approximately equals QPS multiplied by p99 latency: 10,000 multiplied by 0.05 = 500 concurrent requests in flight. Provision too few instances and your p99 explodes. Provision too many and you pay for idle capacity. The Latency Budget Breakdown: Consider a payment fraud check with 50ms total budget. You might allocate: feature reads from database (15ms), model inference (10ms), safety checks (5ms), network overhead (10ms), buffer (10ms). Every component must respect its sub budget or the whole system breaches Service Level Objectives (SLOs). Ad auction systems are even tighter. Exchanges enforce approximately 100ms end to end. Bidders keep model scoring under 5 to 20ms to leave room for network hops, candidate retrieval, and feature fetch. At Meta or Google scale, that is tens of thousands of QPS sustained with sharp event driven spikes.
Typical Real-time Latency Budgets
5 to 20ms
MODEL SCORING
50 to 100ms
TOTAL p99
Micro-batching: The Secret Weapon: GPU inference is expensive per request but efficient in batches. Micro-batching waits 5 to 20 milliseconds to accumulate a small batch (2 to 16 requests), then scores them together. This improves GPU utilization 3x to 5x while keeping p99 latency tolerable. The trade-off: waiting adds base latency. If your budget is 50ms and you wait 10ms to batch, you have only 40ms left for actual inference. Tune batch window and size based on traffic patterns and latency constraints. When Real-time Is Worth It: Choose real-time when per interaction value is high and wrong or late decisions incur immediate loss. Payment fraud costs merchants chargebacks and fees. Ad auctions lose revenue if you cannot bid in time. Ride dispatch degrades if ETA predictions are stale by minutes. But real-time is expensive. Always on capacity, warm pools to avoid cold start penalties (5 to 30 seconds to load models), redundancy for availability. You are paying for p99 performance 24/7, even during low traffic hours.
❗ Remember: Tail latency is what matters. Averages hide outliers. Optimize for p95 and p99, not p50. Use per component timeouts and circuit breakers to prevent one slow dependency from cascading failures across the system.
💡 Key Takeaways
Real-time inference requires strict per request p95/p99 latency (5 to 100ms for traditional models), with always on capacity to handle sustained and spiky traffic
Latency budgets must be allocated across components: feature reads (15ms), model scoring (10ms), safety checks (5ms), with buffers for network and outliers
Micro-batching waits 5 to 20ms to accumulate small batches (2 to 16 requests), improving GPU utilization 3x to 5x while meeting latency constraints
Concurrency requirements scale with QPS and latency: at 10,000 QPS with 50ms p99, you need approximately 500 concurrent request slots provisioned
Choose real-time when per interaction value is high and decisions must be immediate (fraud detection, ad auctions, dispatch), accepting 5x to 20x higher cost than batch
📌 Examples
1Payment fraud scoring at Stripe completes in under 50ms p99 while handling 5,000 to 50,000 QPS during peak checkout hours
2Ad bidders keep model inference under 5 to 20ms to fit within 100ms exchange deadline, handling tens of thousands of sustained QPS with event spikes
3Uber dispatch and ETA predictions complete in under 50 to 100ms p95 to keep app interactions snappy, blending batch aggregates with nearline features
← Back to Batch vs Real-time Inference Overview
Loading...