Model Serving & Inference • Batch vs Real-time InferenceHard⏱️ ~3 min
Real-Time Inference Budgets and Micro-Batching
Real-time inference operates under strict end to end latency budgets that cascade through multiple components. For example, a typical e-commerce product page might have a 150 millisecond p95 render budget. You must allocate sub budgets: feature reads from the online feature store get 20 to 40 milliseconds, candidate retrieval gets 10 to 30 milliseconds, model scoring gets 5 to 20 milliseconds, and you reserve 10 to 20 milliseconds as safety margin for network variance and retry logic. Exceed your budget and you violate user experience SLAs or trigger timeouts that cascade through dependent services.
Micro-batching is a critical technique to improve GPU utilization while respecting latency constraints. Instead of scoring each request individually, the serving system waits a short window, typically 1 to 20 milliseconds, to accumulate a small batch of 2 to 16 requests, then scores them together on the GPU. This can improve throughput by 2 to 5x compared to single request inference. The tradeoff: the first request in the window pays the full wait penalty, so you tune the window size based on your p99 latency target and traffic patterns.
In practice, systems combine micro-batching with aggressive timeouts, circuit breakers, and fallback logic. Google ad auctions enforce approximately 100 milliseconds end to end, leaving bidders 5 to 20 milliseconds for model scoring after accounting for network and feature fetch. Payment fraud systems at scale must decide within 50 milliseconds while handling 5,000 to 50,000 queries per second during peak. Miss your budget and you either block legitimate transactions, increasing abandonment, or let fraud through, increasing chargebacks.
💡 Key Takeaways
•End to end latency budgets cascade through components with strict allocation: violating any sub budget triggers timeouts that can cascade and degrade the entire system
•Micro-batching waits 1 to 20 milliseconds to accumulate small batches of 2 to 16 requests, improving GPU throughput by 2 to 5x while adding latency to the first request in each window
•Concurrency equals queries per second times p99 latency in seconds: a service handling 10,000 queries per second with 50 millisecond p99 needs approximately 500 concurrent threads or connections
•Cold start penalties range from 5 to 30 seconds for model loading and JIT compilation, making warm pools essential for services with p99 latency targets under 100 milliseconds
•Circuit breakers and admission control protect SLOs by rejecting low priority traffic first when saturation approaches, preventing cascading failures across dependent services
📌 Examples
Google ad bidder: 100 millisecond total auction budget, bidder keeps model scoring under 5 to 20 milliseconds using CPU served tree ensembles, leaving room for network and feature fetch at 10,000+ queries per second
TikTok For You page: 200 millisecond p95 budget for entire ranking pipeline, online ranker gets 50 millisecond slice to score 500 candidates using micro-batching with 10 millisecond windows
Real-time LLM chat (ChatGPT style): First token latency target 200 to 800 milliseconds, then stream 20 to 100 tokens per second; small batches of 1 to 8 to keep interactive feel while maximizing GPU utilization