Design FundamentalsLatency vs ThroughputMedium⏱️ ~3 min

How Latency and Throughput Interact Through Queuing and Utilization

Latency and throughput interact fundamentally through queuing behavior. For a single server queue with service rate μ (requests per second the system can handle) and arrival rate λ (incoming requests per second), average wait time grows as 1/(μ − λ). This relationship creates a knee in the latency curve: as utilization (λ/μ) approaches 100%, queue wait times explode non linearly. A system running at 80% utilization might have 10 ms queuing delay, but at 95% utilization that same system could see 200 ms queuing delay. This is why production systems that run hot show highly volatile tail latency and why keeping utilization below 70% is a common rule of thumb for services with strict latency SLOs. Techniques that increase throughput often add latency, and techniques that reduce latency can reduce throughput. Batching is the classic example: collecting requests into batches amortizes overhead (fewer system calls, better device utilization, improved compression ratios) and increases throughput significantly. However, batching adds queuing delay equal to the batch window. If you batch in 1 second windows, each item can wait up to 1 second before processing begins. Compression trades CPU time for reduced data transfer: on a 100 Mbps link, compressing a 1 MB payload from 1 MB to 500 KB and spending 5 ms CPU reduces total latency from 80 ms to 45 ms (a win). On a 1 Gbps link, the same payload takes 8 ms to send uncompressed, so compression yields 4 ms send time plus 5 ms CPU equals 9 ms total (a loss). The design task is choosing where to sit on the latency throughput curve for each workload. Fanout architectures amplify tail latency through probability. If a user request fans out to 100 backend services and each service has a per call p99 of 100 ms, the probability that all 100 return under 100 ms is 0.99^100, which equals roughly 0.366 or 37%. That means 63% of user requests will exceed 100 ms solely due to tail amplification, even though each individual service looks healthy in isolation with 99% of calls under 100 ms. This motivates hedged requests (issuing a duplicate request after a delay threshold), aggressive timeouts, and reducing fanout width and depth. Google Search pipelines minimize fanout and use request reordering to mitigate this effect.
💡 Key Takeaways
Average wait time in a queue grows as 1/(μ − λ) where μ is service rate and λ is arrival rate; as utilization approaches 100%, latency explodes non linearly
Systems should operate at 50% to 70% utilization for stable p99 latency; at 95% utilization, small traffic spikes cause multi second queuing delays without warning
Batching increases throughput by amortizing overhead but adds up to the batch window duration in latency per item; 1 second batching windows add up to 1 second latency
Compression trades CPU for bandwidth: helps latency on slow links (100 Mbps link, 1 MB payload: 80 ms uncompressed vs 45 ms with compression) but hurts on fast links (1 Gbps link: 8 ms uncompressed vs 9 ms with compression)
Fanout amplifies tail latency: 100 services each with p99 of 100 ms means only 37% of user requests finish under 100 ms due to probability compounding (0.99^100 = 0.366)
Little's Law provides capacity planning guidance: to serve 15,000 RPS at 70 ms latency requires roughly 1,050 requests in flight across all tiers (15,000 × 0.07 = 1,050)
📌 Examples
Production capacity planning: if median service time is 5 ms per request (μ = 200 requests/second per core), 16 cores yield 3,200 requests/second; target peak arrival at 2,200 requests/second (69% utilization) or autoscale earlier to maintain tail latency SLOs
Google Spanner: to provide external consistency, Spanner introduces commit wait equal to TrueTime uncertainty (typically ≤7 ms); this adds directly to write latency but enables strong consistency while scaling throughput horizontally
Compression trade calculus: 100 Mbps link with 1 MB payload takes 80 ms to send; compression to 500 KB with 5 ms CPU cost yields 40 ms send plus 5 ms CPU equals 45 ms total (win); on 1 Gbps link, 8 ms send becomes 4 ms send plus 5 ms CPU equals 9 ms (loss)
Bufferbloat edge case: large network buffers keep links busy and show high throughput metrics but add hundreds of milliseconds queuing latency under load, destroying interactive application performance despite high measured throughput
← Back to Latency vs Throughput Overview
How Latency and Throughput Interact Through Queuing and Utilization | Latency vs Throughput - System Overflow