Database DesignDistributed SQL (CockroachDB, Spanner)Hard⏱️ ~3 min

Failure Modes in Distributed SQL Systems

The most fundamental failure mode is loss of quorum in a consensus group. If a data range has three replicas and two regions fail or lose connectivity, that range becomes completely unavailable for reads and writes even though one replica is still alive. This is an intentional trade off to preserve consistency: serving from a minority could violate serializability. Google mitigates this for critical datasets by using five or seven replica configurations spread across many zones, accepting higher replication costs to reduce the probability of quorum loss. Network partitions and tail latency spikes create cascading problems. During an incident, if cross region latency suddenly jumps from 50 milliseconds to 500 milliseconds, all global write transactions slow proportionally because consensus requires cross region round trips. Transaction timeouts start firing, applications retry, and the retry storm can amplify the outage. Uber has documented cascading failures where client retries overwhelmed healthy regions after a partial outage. Careful timeout configuration, exponential backoff, and circuit breakers in application code are essential defenses. Clock issues represent a correctness risk. In Spanner, if TrueTime uncertainty suddenly grows due to GPS signal loss or atomic clock failures, the system must increase commit wait times to maintain external consistency. Transactions slow down proportionally to the uncertainty. In the extreme case where uncertainty becomes unacceptable, writes may be rejected entirely. In CockroachDB, if a node's physical clock drifts beyond the configured maximum offset (typically 500 milliseconds), that node immediately shuts itself down rather than risk violating consistency. This protects correctness but can trigger availability issues if clock synchronization infrastructure fails broadly. Hotspots cause performance degradation. If many clients write to the same key range, for example incrementing a global counter or updating a single popular user's record, one consensus group becomes saturated. Write throughput to that range is limited by the leader replica's CPU and disk I/O. Even though the cluster has thousands of nodes, that hotspot can only handle what one node can process. Systems mitigate this with automatic range splitting and load based rebalancing, but extremely concentrated workloads may still bottleneck. Long running transactions and large schema changes create operational challenges. A transaction that runs for minutes keeps old MVCC versions alive, bloating storage, slowing garbage collection, and increasing read latencies as queries must scan through more versions. Schema migrations on large tables must coordinate across thousands of nodes and can take hours, during which strict availability targets become harder to maintain. Both require careful application design and operational planning to avoid impacting production traffic.
💡 Key Takeaways
Quorum loss (losing 2 of 3 replicas) makes a range completely unavailable for reads and writes to preserve consistency; Google uses 5 to 7 replicas for critical data
Latency spikes cascade: 50ms to 500ms cross region latency slows all global writes proportionally, triggering timeouts and retry storms that amplify outages
Clock failures force correctness preserving degradation: increased TrueTime uncertainty slows Spanner commits, excessive clock drift triggers CockroachDB node shutdowns
Hotspots limit throughput to single leader capacity: writes to one popular key saturate one consensus group regardless of total cluster size
Long running transactions bloat MVCC storage and slow reads; large schema changes require hours of coordination across thousands of nodes
📌 Examples
Uber cascading failure: Partial outage triggers client retries, retry storm overwhelms healthy regions, entire system degrades
Spanner GPS failure: TrueTime uncertainty grows from 5ms to 50ms, commit wait increases proportionally, write latency jumps 10x
CockroachDB clock drift: Node clock drifts 600ms beyond 500ms max offset, node automatically shuts down to protect consistency
Global counter hotspot: 1000 node cluster, all clients increment counter key, throughput limited to single leader's 10K writes/sec
← Back to Distributed SQL (CockroachDB, Spanner) Overview