How Latency and Throughput Interact Through Queuing and Utilization
The Queuing Relationship
Every system has a queue, visible or not. When requests arrive faster than they complete, they wait. The relationship follows a predictable pattern: as arrival rate approaches service rate, wait times grow without bound.
Consider a server that processes 100 requests per second. At 50 RPS load, average queue length stays near zero, latency equals processing time. At 80 RPS, queue length averages 4 requests, latency doubles. At 95 RPS, queue length averages 19 requests, latency is 5x. At 99 RPS, queue length averages 99 requests, latency is 100x. This is queuing theory in action.
Little's Law
Little's Law is the fundamental equation connecting latency and throughput: L = λ × W where L is concurrent requests in the system, λ (lambda) is throughput (requests per second), and W is average latency (seconds).
If your API handles 500 RPS with 200ms average latency: L = 500 × 0.2 = 100. You need capacity for 100 concurrent requests. Double the throughput without reducing latency, you need 200 concurrent request capacity. This is why connection pools, thread counts, and memory scale with throughput × latency.
The Utilization Cliff
The relationship between utilization and latency is nonlinear. At low utilization, adding load barely affects latency. Past 70% utilization, latency climbs steeply. Past 90%, small load increases cause massive latency spikes.
This happens because queuing time follows wait_time = service_time / (1 - utilization). At 50% utilization, wait equals service time. At 90%, wait is 9x service time. At 99%, wait is 99x service time. This cliff is why production systems target 60 to 70% capacity, not 90%.
Batching: Trading Latency for Throughput
Batching is the clearest example of the trade-off. Instead of processing requests one at a time, you wait to collect a batch. Individual request latency increases by the wait time, but throughput increases because per-request overhead is amortized.
A database insert with 5ms overhead per request handles 200 inserts/second. Batch 100 inserts together with 10ms total overhead, and you handle 9,900 inserts/second. Latency increases by the batch collection time (maybe 50ms), but throughput increases 50x.