Database DesignDistributed SQL (CockroachDB, Spanner)Hard⏱️ ~3 min

Failure Modes in Distributed SQL Systems

Quorum Loss

The most fundamental failure mode is loss of quorum (the minimum number of replicas required to make decisions) in a consensus group. Distributed SQL typically replicates each range across 3 nodes, requiring 2 of 3 to agree on any write. If two replicas fail or lose network connectivity, that range becomes completely unavailable for reads and writes even though one replica remains alive. This is an intentional design choice to preserve consistency: serving from a minority could violate serializability (the guarantee that transactions appear to execute in a serial order) if the minority is stale.

For critical data, operators configure 5 or 7 replicas spread across many availability zones, accepting higher replication costs to reduce quorum loss probability. With 5 replicas, the system tolerates 2 failures. With 7, it tolerates 3. The trade off is clear: more replicas mean more network traffic, higher storage costs, and slightly higher write latency, but dramatically lower probability that any single range becomes unavailable.

Network Partitions and Cascading Failures

Network partitions (when network failures split nodes into groups that cannot communicate) and tail latency spikes create cascading problems. During an incident, if cross-region latency suddenly jumps from 50ms to 500ms, all global write transactions slow proportionally because consensus requires cross-region round-trips. Transaction timeouts start firing, applications retry failed requests, and the retry storm amplifies the original problem by adding more load to already struggling infrastructure.

Cascading failures compound rapidly. Client retries overwhelm healthy regions that now must handle both normal traffic and retried requests. Exponential backoff (progressively increasing delay between retries) and circuit breakers (stopping requests when failure rate exceeds threshold) in application code are essential defenses. Without them, a partial outage in one region can cascade into complete system unavailability across all regions.

Clock Issues

Clock issues represent a correctness risk unique to distributed SQL. Systems using TrueTime depend on atomic clocks and GPS receivers to keep uncertainty low (typically 1 to 7 milliseconds). If GPS signal degrades or atomic clocks fail, uncertainty grows. When uncertainty reaches 50 milliseconds, commit wait times increase proportionally, slowing transactions 10x. In extreme cases where uncertainty becomes unacceptably high, writes may be rejected entirely to protect consistency.

Systems using Hybrid Logical Clocks enforce maximum clock drift between nodes, typically 500 milliseconds. If a node detects its clock has drifted beyond this threshold relative to other nodes, it immediately shuts itself down rather than risk violating consistency. This protects correctness but can trigger availability issues if NTP (Network Time Protocol, which synchronizes clocks across machines) fails broadly.

Hotspots and Long Transactions

Hotspots cause performance degradation when many clients write to the same key range. A global counter, a single popular user record, or an auto-incrementing primary key that always writes to the latest range can saturate one consensus group while thousands of other nodes sit idle. Write throughput to that range is limited by what one leader replica can process, regardless of total cluster size.

Long-running transactions create storage and performance problems. MVCC (Multi-Version Concurrency Control) keeps old versions alive while a transaction runs. A transaction running for minutes or hours prevents garbage collection of versions it might read, bloating storage and slowing queries that must scan through accumulated versions. Schema migrations on large tables must coordinate across thousands of nodes and can take hours, requiring careful planning to avoid impacting production traffic.

💡 Key Takeaways
Quorum loss (losing 2 of 3 replicas) makes a range unavailable for reads and writes; using 5 or 7 replicas reduces probability but increases replication costs
Network partitions and latency spikes cascade: retry storms from client timeouts amplify partial outages into complete system unavailability
Exponential backoff and circuit breakers in application code are essential defenses against cascading failures from retries overwhelming healthy nodes
Clock failures force correctness-preserving degradation: TrueTime uncertainty growth slows commits, excessive Hybrid Logical Clock drift triggers node shutdowns
Hotspots limit throughput to single leader capacity: writes to one popular key saturate one consensus group regardless of total cluster size
Long running transactions prevent MVCC garbage collection, bloating storage and slowing reads that must scan accumulated versions
📌 Interview Tips
1Explain quorum math: 3 replicas need 2 for writes, tolerates 1 failure; 5 replicas need 3, tolerates 2; 7 need 4, tolerates 3
2Discuss cascading defense: exponential backoff starts at 100ms, doubles each retry (200ms, 400ms, 800ms), circuit breaker opens after 10 consecutive failures
3Mention hotspot mitigation: avoid auto-incrementing keys that always write to latest range, use UUIDs or customer-prefixed keys for even distribution
← Back to Distributed SQL (CockroachDB, Spanner) Overview
Failure Modes in Distributed SQL Systems | Distributed SQL (CockroachDB, Spanner) - System Overflow