Batching Trade offs: Throughput vs Tail Latency
The Throughput vs Latency Trade-off
Batching inference requests is a fundamental trade off between throughput and tail latency. Moving from single request serving to batch equals 4 on one GPU often raises tokens per second by 2 to 4x, directly reducing cost per 1000 tokens and increasing hardware utilization. However, queueing delay increases p95 latency by 50 to 150ms when not carefully tuned, as requests wait for a batch to fill before processing begins. At high utilization, queueing delay exceeds service time, causing tail spikes that violate SLOs.
Dynamic Batching with Max Wait
Systems enforce a maximum wait of 10 to 20ms before flushing a partial batch, ensuring p95 latency stays within budget even at low query rates. When traffic is high, batches fill quickly and throughput stays high. When traffic is sparse, requests proceed immediately after max wait expires, preventing tail violations. This pattern is critical for interactive systems targeting sub 200ms end to end latency, where even 50ms of queueing delay consumes a significant portion of the budget.
Cost Quality Dimension
Larger batches with lower precision formats (FP16 or INT8 quantization) can yield 2 to 4x throughput improvement with negligible quality loss for many ranking and recommendation tasks. Netflix and Uber use quantized models in production for feed ranking, achieving sub 100ms p95 latency while serving 10000 QPS per node. However, nuanced reasoning tasks may degrade with quantization, requiring A/B tests to validate quality before rollout.
Traffic Storm Failures
When batches take too long due to long context or low token throughput (under 10 tokens per second from throttling), slow clients retry, amplifying load. Admission control and load shedding when queues exceed 50 to 100ms prevent cascading failures. Monitor both requests per second and tokens per second per device, as token throughput varies nonlinearly with context length.