Real-time Inference: Latency Under Pressure
The Real-time Challenge
Real-time inference means you have 5 to 100 milliseconds to fetch features, run a model, and return a prediction, at sustained queries per second with traffic spikes. Miss your latency budget and users see loading spinners, transactions time out, or ads fail to render. The math is unforgiving: at 10,000 QPS with 50ms p99 latency target, you need enough concurrency to absorb bursts without queueing. Concurrency approximately equals QPS multiplied by p99 latency: 10,000 multiplied by 0.05 = 500 concurrent requests in flight. Provision too few instances and your p99 explodes. Provision too many and you pay for idle capacity.
The Latency Budget Breakdown
Consider a payment fraud check with 50ms total budget. You might allocate: feature reads from database (15ms), model inference (10ms), safety checks (5ms), network overhead (10ms), buffer (10ms). Every component must respect its sub budget or the whole system breaches Service Level Objectives (SLOs). Ad auction systems are even tighter. Exchanges enforce approximately 100ms end to end. Bidders keep model scoring under 5 to 20ms to leave room for network hops, candidate retrieval, and feature fetch. At Meta or Google scale, that is tens of thousands of QPS sustained with sharp event driven spikes.
Micro-batching: The Secret Weapon
GPU inference is expensive per request but efficient in batches. Micro-batching waits 5 to 20 milliseconds to accumulate a small batch (2 to 16 requests), then scores them together. This improves GPU utilization 3x to 5x while keeping p99 latency tolerable. The trade-off: waiting adds base latency. If your budget is 50ms and you wait 10ms to batch, you have only 40ms left for actual inference. Tune batch window and size based on traffic patterns and latency constraints.
When Real-time Is Worth It
Choose real-time when per interaction value is high and wrong or late decisions incur immediate loss. Payment fraud costs merchants chargebacks and fees. Ad auctions lose revenue if you cannot bid in time. Ride dispatch degrades if ETA predictions are stale by minutes. But real-time is expensive. Always on capacity, warm pools to avoid cold start penalties (5 to 30 seconds to load models), redundancy for availability. You are paying for p99 performance 24/7, even during low traffic hours.