Design Fundamentals • Latency vs ThroughputHard⏱️ ~3 min
Implementation Patterns: Latency Budgets, Hedging, and BDP Aware Tuning
Implementing latency and throughput optimization requires concrete techniques grounded in measurement and budgeting. Start by setting end to end latency SLOs by percentile, not average: for example, p99 ≤ 200 ms for user facing requests. Decompose this budget across all hops in the request path with explicit per hop allocations. A 200 ms budget might allocate 20 ms for network RTT to an edge POP, 5 to 10 ms for CDN or cache lookup, 40 to 60 ms for origin compute, and headroom for up to 2 parallel backend calls with p99 of 30 ms each plus hedging after 40 ms. Every component must honor its budget via timeouts and deadlines propagated through the call stack. Collect high resolution latency histograms (not averages) using tools like HdrHistogram and track tail behavior under varying load. Use Little's Law to size concurrency: if you need 15,000 RPS at 70 ms p95 latency, you need roughly 1,050 requests in flight (15,000 × 0.07), so size thread pools, connection pools, and async event loops accordingly.
Controlling queuing is essential for maintaining tail latency under load. Keep utilization below 70% to 80% in steady state; beyond that point, tail latency spikes non linearly. Implement admission control that rejects or queues requests early when approaching saturation rather than letting them enter the system and time out deep in the stack. Prefer multiple small independent queues over a single large FIFO queue to avoid head of line blocking where one slow job stalls everything behind it. Bound queue lengths explicitly to cap worst case latency: if your SLO allows 200 ms end to end and one hop has 50 ms service time, queue length should not exceed 3 to 4 items or you risk violating the budget. Use adaptive batching that batches when queues are non empty (amortizing overhead) but immediately dispatches when queues are empty, combined with latency guards that flush every X milliseconds to bound per item latency.
Network throughput optimization requires BDP aware tuning. The bandwidth delay product (BDP = bandwidth × RTT) determines how much data must be in flight to fully utilize a link. On a 5 Gbps path with 60 ms RTT, BDP is approximately 37.5 MB. If TCP window sizes, socket buffers, or application level flow control windows are smaller than BDP, throughput will be capped below link capacity. Modern Linux systems auto tune receive buffers, but send buffers and application logic may need explicit tuning for high throughput over high latency paths. Hedged requests improve tail latency by issuing a duplicate request after the original exceeds a high percentile threshold (typically p95 or p99 latency under normal conditions). Cap concurrent hedges (for example, max 1 backup per original) and use per client jitter to avoid synchronized bursts. Ensure idempotency and implement server side duplicate suppression using request IDs. Hedging trades modest increases in load (10% to 20% more requests under normal conditions, higher under degradation) for significant tail latency improvements (often cutting p99 by 2× to 5×).
💡 Key Takeaways
•Set percentile based SLOs (p99 ≤ 200 ms) and decompose into per hop budgets: 20 ms network RTT, 5 to 10 ms cache, 40 to 60 ms compute, with headroom for fanout; enforce via timeouts and propagated deadlines
•Use Little's Law for capacity planning: 15,000 RPS at 70 ms latency requires roughly 1,050 requests in flight (15,000 × 0.07); size thread pools, connection pools, and async slots to match concurrency needs
•Keep utilization at 50% to 70% for stable tail latency; implement admission control that rejects requests early when approaching saturation; prefer multiple independent queues over single FIFO to avoid head of line blocking
•Bound queue lengths to cap latency: if one hop has 50 ms service time and SLO allows 200 ms total, queue should not exceed 3 to 4 items or budget is violated
•BDP aware tuning: on 5 Gbps path with 60 ms RTT, BDP is 37.5 MB; socket buffers and flow control windows smaller than BDP cap throughput below link capacity regardless of available bandwidth
•Hedged requests: issue duplicate after p95 or p99 threshold, cap at 1 backup per original, use jitter to avoid synchronized bursts, ensure idempotency; trades 10% to 20% more load for 2× to 5× tail latency improvement
📌 Examples
End to end budget: 200 ms user facing SLO allocates 20 ms edge RTT + 10 ms cache + 60 ms compute + 2 parallel calls at 30 ms p99 each with hedging after 40 ms + serialization only if payload size justifies compression latency cost
Capacity planning: median service time of 5 ms means μ = 200 requests/second per core; 16 cores yield 3,200 requests/second; target peak at 2,200 requests/second (69% utilization) or autoscale earlier to maintain p99 SLO
Adaptive batching: batch when queue is non empty to amortize overhead; immediately dispatch when queue is empty; flush every 10 ms regardless to bound per item latency at 10 ms
Load shedding: protect low latency interactive flows by prioritizing them over background throughput heavy batch jobs; use separate queues and resource pools; drop non critical work first under overload