Design Fundamentals • Latency vs ThroughputEasy⏱️ ~3 min
What Are Latency and Throughput? Core Definitions and Measurement
Latency is the time to complete a single operation end to end: the delay from when a request is sent until its response is received. It should always be measured in percentiles (p50, p95, p99) rather than averages because user experience and system fragility are driven by tail behavior. A system with 10 ms average latency but 500 ms p99 latency will feel broken to 1% of users. Latency has irreducible physical components like speed of light limits (fiber propagation is roughly 200,000 km/s, so New York to London takes 27 to 30 ms one way) and device access times (SSD random reads are 100 to 200 microseconds, spinning disk seeks are 5 to 10 ms). Variable components include queuing delays, lock contention, garbage collection pauses, and network retransmissions.
Throughput is the amount of work done per unit time: requests per second, bytes per second, or transactions per second. It reflects system capacity, parallelism, and efficiency. These two metrics interact through fundamental relationships. Little's Law states that concurrency equals throughput times latency. If you want to serve 10,000 requests per second at 50 ms latency, you need roughly 500 requests in flight somewhere in the system (10,000 × 0.05 = 500). The bandwidth delay product (BDP) governs network throughput: BDP equals bandwidth times round trip time. On a 10 Gbps path with 80 ms RTT, BDP is 100 MB. If your TCP window or application buffer is only 16 MB, you will cap out at 1.6 Gbps regardless of link capacity because you cannot keep enough data in flight to saturate the pipe.
Amazon has reported that adding 100 ms of latency costs around 1% of sales, directly linking p95 and p99 latency to revenue. Google experiments showed that an artificial 500 ms delay reduced user traffic by roughly 20%. These real world numbers demonstrate why latency percentiles matter more than averages and why companies set strict per hop latency budgets. Round trip time multipliers are critical: any protocol requiring multiple handshakes (like TLS with full handshake taking 2 RTTs before data flows) multiplies RTT into user visible latency. On a New York to London path with 60 ms RTT, three round trips add 180 ms before useful work begins.
💡 Key Takeaways
•Latency is measured in milliseconds using percentiles (p50, p95, p99) because tail behavior drives user experience; Amazon loses 1% of sales per 100 ms added latency
•Physical latency floors include speed of light propagation (NYC to London fiber is 27 to 30 ms one way) and device access times (SSD random reads 100 to 200 microseconds, spinning disk seeks 5 to 10 ms)
•Throughput measures capacity in requests per second, bytes per second, or transactions per second and reflects parallelism and efficiency
•Little's Law relates concurrency to throughput and latency: concurrency = throughput × latency; serving 10,000 RPS at 50 ms latency requires roughly 500 requests in flight
•Bandwidth delay product (BDP = bandwidth × RTT) determines network throughput capacity; on a 10 Gbps link with 80 ms RTT, BDP is 100 MB, so buffers smaller than 100 MB will cap throughput below link capacity
•Round trip time multipliers compound quickly: protocols requiring 3 RTTs on a 60 ms path add 180 ms startup cost before data flows; reducing RTT by co-locating services or using edge POPs has multiplicative impact
📌 Examples
TCP window cap: With a 64 KB flow control window and 100 ms RTT, max throughput is 64 KB / 0.1 s = 640 KB/s or roughly 5.1 Mbps, regardless of available bandwidth
Google Web Search: artificial 500 ms delay reduced user traffic by 20%, motivating aggressive fanout reduction and hedged requests to protect p99 latency
Cross region RTT: NYC to London fiber has 55 to 65 ms RTT; pushing content to a metro edge POP with 10 to 20 ms RTT reduces startup cost by 3× for protocols with multiple round trips
Netflix HD streaming: each stream needs sustained 3 to 7 Mbps with startup latency targets around 1 to 2 seconds; deploying caches inside ISP networks reduces user to cache RTT to single digit milliseconds in metro areas