Model Serving & InferenceMonitoring & Observability (Latency, Drift, Performance)Medium⏱️ ~3 min

Batching Trade offs: Throughput vs Tail Latency

Batching inference requests is a fundamental trade off between throughput and tail latency. Moving from single request serving to batch equals 4 on one GPU often raises tokens per second by 2 to 4×, directly reducing cost per 1000 tokens and increasing hardware utilization. However, queueing delay increases p95 latency by 50 to 150 milliseconds when not carefully tuned, as requests wait for a batch to fill before processing begins. At high utilization, queueing delay exceeds service time, causing tail spikes that violate Service Level Objectives (SLOs). Dynamic batching with a strict max wait threshold balances the trade off. Systems enforce a maximum wait of 10 to 20 milliseconds before flushing a partial batch, ensuring p95 latency stays within budget even at low query rates. When traffic is high, batches fill quickly and throughput stays high. When traffic is sparse, requests proceed immediately after max wait expires, preventing tail violations. This pattern is critical for interactive systems targeting sub 200 millisecond end to end latency, where even 50 milliseconds of queueing delay consumes a significant portion of the budget. The cost quality dimension adds another layer. Larger batches with lower precision formats (FP16 or INT8 quantization) can yield 2 to 4× throughput improvement with negligible quality loss for many ranking and recommendation tasks. Netflix and Uber use quantized models in production for feed ranking, achieving sub 100 millisecond p95 latency while serving 10000 Queries Per Second (QPS) per node. However, nuanced reasoning tasks or high stakes predictions may degrade with quantization, requiring A/B tests to validate quality before rollout. Failure modes include traffic storms from slow retries and token level pathologies. When batches take too long due to long context or low token throughput (under 10 tokens per second from throttling), slow clients retry, amplifying load. Long outputs blow SLOs even if Time to First Token (TTFB) is fast. Admission control and load shedding when queues exceed 50 to 100 milliseconds prevent cascading failures. Monitor both requests per second and tokens per second per device, as token throughput varies nonlinearly with context length and sampling settings.
💡 Key Takeaways
Batching from single request to batch equals 4 increases tokens per second by 2 to 4×, reducing cost per 1000 tokens but raising p95 latency by 50 to 150 milliseconds due to queueing delay
Dynamic batching with max wait of 10 to 20 milliseconds flushes partial batches to prevent tail violations, critical for systems targeting sub 200 millisecond end to end latency
Quantization (FP16 or INT8) combined with batching yields 2 to 4× throughput improvement with negligible quality loss for ranking tasks, but may degrade nuanced reasoning requiring A/B validation
At high utilization, queueing delay exceeds service time causing tail spikes; admission control and load shedding at 50 to 100 milliseconds queue depth prevent cascading failures
Monitor both requests per second and tokens per second per device, as token throughput varies nonlinearly with context length and sampling settings, impacting actual serving capacity
📌 Examples
Netflix feed ranking: quantized INT8 models with batch equals 8 serve 10000 QPS per node at sub 100 millisecond p95, reducing infrastructure cost by 60 percent versus FP32 single request
Uber ETA prediction: dynamic batching with 15 millisecond max wait maintains 80 millisecond p95 latency while achieving 3× throughput versus no batching, critical for sub 200 millisecond SLO
Meta ad ranking: switching from batch equals 1 to batch equals 16 with FP16 increased throughput 5×, but initial deployment without max wait caused p99 spikes to 400 milliseconds, fixed by enforcing 20 millisecond flush
Airbnb search ranking: long context inputs (2000 tokens) reduced token throughput from 50 to 12 tokens per second, causing slow retries and traffic storms until admission control capped queue depth at 100 milliseconds
← Back to Monitoring & Observability (Latency, Drift, Performance) Overview
Batching Trade offs: Throughput vs Tail Latency | Monitoring & Observability (Latency, Drift, Performance) - System Overflow