Performance, Capacity, and Little's Law

Performance and capacity are coupled through queueing effects, and Little's Law provides the mathematical relationship: Concurrency equals Throughput multiplied by Latency. If your service processes 10,000 requests per second (RPS) with 50 milliseconds average latency, you have 500 concurrent requests in flight at any moment. When latency increases to 100 milliseconds at the same throughput, concurrency doubles to 1,000. This explains why even small latency increases cause rising concurrency and eventual system meltdown at high request rates.

Tail latencies dominate user experience and cascade across microservices. If you have five microservices in a call chain and each has p99 latency of 200 milliseconds independent of the others, the overall p99 approaches 1 second due to tail amplification. A practical goal is to keep p99 latency under five times p50, and to provision 20 to 40 percent headroom above expected peak load for diurnal traffic patterns and failover scenarios. Without this buffer, even small spikes push you into the non linear region where latency explodes.

Network physics imposes hard limits that no amount of optimization can overcome. Round Trip Time (RTT) within a single cloud region is typically 0.3 to 2 milliseconds; cross country in the United States is 40 to 80 milliseconds; intercontinental connections are 80 to 150 milliseconds. Strong consistency protocols that require synchronous cross region coordination must budget for these latencies. Google Spanner adds approximately 5 to 10 milliseconds of commit wait time on top of Wide Area Network (WAN) round trips, resulting in multi continent transactions routinely seeing 100+ milliseconds of total commit latency.

💡 Key Takeaways

•Little's Law (Concurrency = Throughput × Latency) explains cascading failures: at 10,000 RPS, latency growing from 50 ms to 500 ms increases concurrency from 500 to 5,000 in flight requests, exhausting connection pools and memory

•Tail latencies amplify across service hops: five services each with independent p99 of 200 ms combine to nearly 1 second overall p99, making tail latency budgets critical in microservice architectures

•Provision 20 to 40 percent headroom above peak traffic to stay in the linear performance region; without this buffer, queueing effects cause latency to grow non linearly and systems enter unstable overload

•Network RTT sets absolute lower bounds: intra region 0.3 to 2 ms, cross country 40 to 80 ms, intercontinental 80 to 150 ms; synchronous cross region consistency adds another 5 to 10 ms commit wait, making global strong consistency inherently slow

📌 Examples

Google Spanner uses TrueTime with approximately 7 ms uncertainty windows and commit wait, causing multi continent transactions to routinely exceed 100 ms total latency due to WAN RTT plus synchronization overhead

WhatsApp handles over 100 billion messages per day (averaging roughly 1.16 million messages per second) using asynchronous, partitioned architecture with backpressure to avoid synchronous cross region coordination and keep latency low

← Back to Scalability Fundamentals Overview