What Are Latency and Throughput? Core Definitions and Measurement
Why These Metrics Matter
Think of a highway: latency is how long your car takes to travel from point A to point B, while throughput is how many cars pass a checkpoint per hour. A highway with high throughput (six lanes) can still have high latency (traffic jam). Optimizing one often hurts the other.
Users feel latency directly. A 100ms response feels instant, 300ms feels sluggish, and 1000ms feels broken. Meanwhile, throughput determines whether your system survives traffic spikes. A system handling 1,000 requests per second will crash under 10,000 RPS load regardless of how fast each individual request completes.
Measuring Latency Correctly
Average latency lies. If 99 requests take 10ms and 1 request takes 1000ms, the average is 20ms but users experience 1000ms 1% of the time. Use percentiles instead: p50 (median), p95, and p99.
p50: Half of requests are faster than this. Shows typical experience.
p95: 95% of requests are faster. Shows what most users experience.
p99: 99% of requests are faster. Shows worst case for nearly everyone.
A healthy API might show: p50=15ms, p95=45ms, p99=120ms. If p99 is 10x higher than p50, you have tail latency problems that will compound in distributed systems.
Measuring Throughput
Throughput is measured in operations per second: RPS (requests per second), QPS (queries per second), or TPS (transactions per second). The maximum sustainable throughput is your system's capacity.
Peak throughput is misleading. A system might handle 10,000 RPS briefly but only sustain 5,000 RPS before queues grow unbounded and latency explodes. Always measure sustained throughput under steady state conditions, not burst capacity.