Learn→Model Serving & Inference→Monitoring & Observability (Latency, Drift, Performance)→3 of 6

Model Serving & Inference • Monitoring & Observability (Latency, Drift, Performance)Medium⏱️ ~3 min

Batching Trade offs: Throughput vs Tail Latency

The Throughput vs Latency Trade-off
Batching inference requests is a fundamental trade off between throughput and tail latency. Moving from single request serving to batch equals 4 on one GPU often raises tokens per second by 2 to 4x, directly reducing cost per 1000 tokens and increasing hardware utilization. However, queueing delay increases p95 latency by 50 to 150ms when not carefully tuned, as requests wait for a batch to fill before processing begins. At high utilization, queueing delay exceeds service time, causing tail spikes that violate SLOs.
Dynamic Batching with Max Wait
Systems enforce a maximum wait of 10 to 20ms before flushing a partial batch, ensuring p95 latency stays within budget even at low query rates. When traffic is high, batches fill quickly and throughput stays high. When traffic is sparse, requests proceed immediately after max wait expires, preventing tail violations. This pattern is critical for interactive systems targeting sub 200ms end to end latency, where even 50ms of queueing delay consumes a significant portion of the budget.
Cost Quality Dimension
Larger batches with lower precision formats (FP16 or INT8 quantization) can yield 2 to 4x throughput improvement with negligible quality loss for many ranking and recommendation tasks. Netflix and Uber use quantized models in production for feed ranking, achieving sub 100ms p95 latency while serving 10000 QPS per node. However, nuanced reasoning tasks may degrade with quantization, requiring A/B tests to validate quality before rollout.
Traffic Storm Failures
When batches take too long due to long context or low token throughput (under 10 tokens per second from throttling), slow clients retry, amplifying load. Admission control and load shedding when queues exceed 50 to 100ms prevent cascading failures. Monitor both requests per second and tokens per second per device, as token throughput varies nonlinearly with context length.

💡 Key Takeaways

✓Batching from single request to batch equals 4 increases tokens per second by 2 to 4×, reducing cost per 1000 tokens but raising p95 latency by 50 to 150 milliseconds due to queueing delay

✓Dynamic batching with max wait of 10 to 20 milliseconds flushes partial batches to prevent tail violations, critical for systems targeting sub 200 millisecond end to end latency

✓Quantization (FP16 or INT8) combined with batching yields 2 to 4× throughput improvement with negligible quality loss for ranking tasks, but may degrade nuanced reasoning requiring A/B validation

✓At high utilization, queueing delay exceeds service time causing tail spikes; admission control and load shedding at 50 to 100 milliseconds queue depth prevent cascading failures

✓Monitor both requests per second and tokens per second per device, as token throughput varies nonlinearly with context length and sampling settings, impacting actual serving capacity

📌 Interview Tips

1Netflix feed ranking: quantized INT8 models with batch equals 8 serve 10000 QPS per node at sub 100 millisecond p95, reducing infrastructure cost by 60 percent versus FP32 single request

2Uber ETA prediction: dynamic batching with 15 millisecond max wait maintains 80 millisecond p95 latency while achieving 3× throughput versus no batching, critical for sub 200 millisecond SLO

3Meta ad ranking: switching from batch equals 1 to batch equals 16 with FP16 increased throughput 5×, but initial deployment without max wait caused p99 spikes to 400 milliseconds, fixed by enforcing 20 millisecond flush

4Airbnb search ranking: long context inputs (2000 tokens) reduced token throughput from 50 to 12 tokens per second, causing slow retries and traffic storms until admission control capped queue depth at 100 milliseconds

← Back to Monitoring & Observability (Latency, Drift, Performance) Overview