Failure Modes in Distributed SQL Systems
Quorum Loss
The most fundamental failure mode is loss of quorum (the minimum number of replicas required to make decisions) in a consensus group. Distributed SQL typically replicates each range across 3 nodes, requiring 2 of 3 to agree on any write. If two replicas fail or lose network connectivity, that range becomes completely unavailable for reads and writes even though one replica remains alive. This is an intentional design choice to preserve consistency: serving from a minority could violate serializability (the guarantee that transactions appear to execute in a serial order) if the minority is stale.
For critical data, operators configure 5 or 7 replicas spread across many availability zones, accepting higher replication costs to reduce quorum loss probability. With 5 replicas, the system tolerates 2 failures. With 7, it tolerates 3. The trade off is clear: more replicas mean more network traffic, higher storage costs, and slightly higher write latency, but dramatically lower probability that any single range becomes unavailable.
Network Partitions and Cascading Failures
Network partitions (when network failures split nodes into groups that cannot communicate) and tail latency spikes create cascading problems. During an incident, if cross-region latency suddenly jumps from 50ms to 500ms, all global write transactions slow proportionally because consensus requires cross-region round-trips. Transaction timeouts start firing, applications retry failed requests, and the retry storm amplifies the original problem by adding more load to already struggling infrastructure.
Cascading failures compound rapidly. Client retries overwhelm healthy regions that now must handle both normal traffic and retried requests. Exponential backoff (progressively increasing delay between retries) and circuit breakers (stopping requests when failure rate exceeds threshold) in application code are essential defenses. Without them, a partial outage in one region can cascade into complete system unavailability across all regions.
Clock Issues
Clock issues represent a correctness risk unique to distributed SQL. Systems using TrueTime depend on atomic clocks and GPS receivers to keep uncertainty low (typically 1 to 7 milliseconds). If GPS signal degrades or atomic clocks fail, uncertainty grows. When uncertainty reaches 50 milliseconds, commit wait times increase proportionally, slowing transactions 10x. In extreme cases where uncertainty becomes unacceptably high, writes may be rejected entirely to protect consistency.
Systems using Hybrid Logical Clocks enforce maximum clock drift between nodes, typically 500 milliseconds. If a node detects its clock has drifted beyond this threshold relative to other nodes, it immediately shuts itself down rather than risk violating consistency. This protects correctness but can trigger availability issues if NTP (Network Time Protocol, which synchronizes clocks across machines) fails broadly.
Hotspots and Long Transactions
Hotspots cause performance degradation when many clients write to the same key range. A global counter, a single popular user record, or an auto-incrementing primary key that always writes to the latest range can saturate one consensus group while thousands of other nodes sit idle. Write throughput to that range is limited by what one leader replica can process, regardless of total cluster size.
Long-running transactions create storage and performance problems. MVCC (Multi-Version Concurrency Control) keeps old versions alive while a transaction runs. A transaction running for minutes or hours prevents garbage collection of versions it might read, bloating storage and slowing queries that must scan through accumulated versions. Schema migrations on large tables must coordinate across thousands of nodes and can take hours, requiring careful planning to avoid impacting production traffic.