Design FundamentalsLatency vs ThroughputMedium⏱️ ~3 min

Trade Offs Between Latency and Throughput in System Design Decisions

Every major system design decision involves explicit trade offs between latency and throughput. Consistency models provide a clear example: linearizable writes require coordination latency through quorum protocols or commit wait mechanisms. Google Spanner adds roughly 7 ms of commit wait to every write transaction to satisfy TrueTime uncertainty bounds and provide external consistency. This latency cost buys strong consistency guarantees while Spanner scales throughput horizontally through partitioning and locality aware placement. In contrast, eventually consistent systems like Cassandra favor high write throughput using LSM trees that batch writes to disk, but this creates read amplification (reads must check multiple SSTables) and sometimes higher read latency. The choice depends on application requirements: financial transactions typically require strong consistency and accept the latency cost, while social media feeds can tolerate eventual consistency to maximize write throughput. Data locality and replication present another fundamental trade off. Geographic replication improves local read latency and throughput by serving users from nearby replicas: a user in Tokyo reading from a Tokyo replica experiences 5 to 10 ms latency instead of 150+ ms to a US data center. However, synchronous cross region replication for writes inflates latency dramatically. A synchronous write quorum across US East, US West, and Europe adds 50 to 150 ms per write due to cross continental round trips. Asynchronous replication boosts write throughput and reduces write latency but introduces replication lag (staleness windows of seconds to minutes) and potential data loss on failover. Netflix uses asynchronous replication for most data, accepting eventual consistency to serve high throughput streaming workloads, while banking systems use synchronous replication within a region and careful failover procedures to protect against data loss. Concurrency and parallelism provide throughput gains but create queuing and contention that hurt tail latency. Raising thread pool sizes or connection counts increases throughput by allowing more work in parallel, but beyond a knee point (typically 70% to 80% utilization) you hit contention on shared resources (locks, memory bandwidth, network queues) and tail latency spikes. The optimal operating point requires headroom: run at 50% to 70% of peak capacity in steady state and use autoscaling or load shedding to handle bursts. Protocol chattiness trades payload efficiency for latency: sending many small requests reduces per request latency but increases overhead and can underutilize links. Coalescing multiple requests into one round trip improves throughput but adds batching delay. HTTP/2 multiplexing reduces head of line blocking compared to HTTP/1.1 but introduces new tail latency risks if streams are not prioritized correctly.
💡 Key Takeaways
Consistency models: linearizable writes add coordination latency (Google Spanner commit wait ~7 ms) but provide strong guarantees; eventual consistency maximizes write throughput (Cassandra LSM trees) at cost of read amplification and staleness
Geographic replication: local replicas cut read latency from 150+ ms cross continental to 5 to 10 ms local, but synchronous cross region writes add 50 to 150 ms; asynchronous replication boosts throughput but risks data loss and replication lag
Concurrency trade off: raising parallelism increases throughput until 70% to 80% utilization, then contention causes tail latency spikes; operate at 50% to 70% capacity for stable p99 latency and autoscale or shed load beyond that
Batching vs interactivity: batching increases throughput through amortized overhead and better device utilization but adds up to the batch window in per item latency; streaming record at a time reduces latency but can underutilize resources
Caching: edge and memory caches reduce read latency dramatically (cache hit in microseconds vs backend round trip in milliseconds) and offload backend throughput, but cause inconsistency windows, cache stampedes, and uneven load without careful invalidation
Protocol chattiness: reducing round trips lowers latency disproportionately on high RTT paths (3 RTTs on 60 ms path = 180 ms startup); coalescing and pipelining increase throughput but can add head of line blocking if not designed carefully
📌 Examples
Financial transactions: require linearizable consistency and accept 7 to 20 ms coordination latency cost; social media feeds use eventual consistency to maximize write throughput and serve millions of posts per second
Netflix streaming: uses asynchronous replication and eventual consistency to serve high throughput (each 4K stream needs 15 to 25 Mbps sustained); banking systems use synchronous replication within region to prevent data loss despite latency cost
Cache hit vs miss: memory cache hit serves in 1 to 10 microseconds; cache miss requires backend round trip of 10 to 100+ milliseconds depending on distance and load, showing 1000× to 10000× latency difference
TCP coalescing trade off: Nagle algorithm coalesces small writes improving throughput on bulk transfers but adds tens of milliseconds to chatty RPCs; disabling Nagle (TCP_NODELAY) reduces interactive latency but increases packet overhead and CPU
← Back to Latency vs Throughput Overview
Trade Offs Between Latency and Throughput in System Design Decisions | Latency vs Throughput - System Overflow