Failure Modes: Cascading Failures, Split Brain, and Correlated Outages

Cascading failures occur when a downstream dependency slows or fails, causing upstream clients to time out and retry synchronously, amplifying load on an already unhealthy system. For example, at 100,000 requests per second with a 1% error rate and clients configured to retry twice, the failing dependency receives an additional 2,000 RPS surge, pushing error rates higher and triggering more timeouts in a vicious cycle. Without exponential backoff, jitter, and bounded retry budgets, this retry storm can collapse the entire system within seconds. Circuit breakers detect elevated error or latency rates and open to stop sending requests, providing time for the downstream service to recover. Bulkheads isolate resources (thread pools, connection pools) per dependency so that one slow service cannot exhaust resources needed by healthy services.

Split brain scenarios arise in CP systems when network partitions allow multiple nodes to believe they are the leader, corrupting state through concurrent conflicting writes. Quorum protocols like Raft and Paxos prevent this by requiring majority agreement before any write commits, ensuring only one leader exists per term. In AP systems, split brain manifests as conflicting writes that must be resolved later via conflict resolution strategies like last write wins or CRDTs. Correlated failures invalidate independence assumptions: replicas sharing a rack, power feed, network switch, or software version fail simultaneously. A bad deployment pushed to all replicas at once can drop availability from 99.99% to zero instantly. Defense strategies include progressive rollouts (canary 1% of traffic, then 10%, then 50%), independent deployment waves per AZ, and chaos engineering to surface hidden correlations. Control plane outages can make the data plane unavailable even when instances are healthy, such as when service discovery, configuration services, or certificate issuance systems fail. Prefer data plane autonomy by caching configuration locally and pre provisioning certificates.

💡 Key Takeaways

•Cascading failures amplify load on unhealthy dependencies. At 100k RPS with 1% errors and two retries, the failing service receives an extra 2k RPS, worsening the outage.

•Circuit breakers open when error or latency thresholds breach, stopping requests to allow downstream recovery. Bulkheads isolate resource pools per dependency to prevent resource exhaustion.

•Split brain in CP systems allows multiple leaders to corrupt state. Quorum protocols (Raft, Paxos) require majority agreement to prevent this, ensuring single leader per term.

•Correlated failures (shared rack, power, software version, deployment wave) cause simultaneous failures across replicas, invalidating parallel redundancy and collapsing availability to zero.

•Progressive rollouts (canary to 1%, then 10%, then 50%) and independent deployment waves per AZ reduce blast radius of bad deployments and surface issues before full rollout.

•Control plane outages (service discovery, config, certs) can make healthy data plane instances unavailable. Cache configuration locally and pre provision certificates for data plane autonomy.

📌 Examples

A payments API at 100k RPS experiences 1% errors. Clients retry twice without backoff, adding 2k RPS to the failing backend, which increases errors to 5%, triggering 10k retry RPS and total collapse.

Netflix Hystrix circuit breakers open after detecting elevated latency (p99 > threshold for N seconds), returning fallback responses and giving the failing service time to recover.

A Raft cluster network partition causes two nodes to both believe they are leader. Quorum requirement prevents writes from committing on the minority side, avoiding split brain corruption.

A Kubernetes cluster deploys a buggy version to all pods simultaneously. Health checks pass initially but memory leaks cause all pods to OOM within 10 minutes, dropping availability to zero.

A service depends on an external certificate authority for TLS. CA outage prevents new connections even though service instances are healthy. Pre provisioning certificates avoids this failure mode.

← Back to Availability & Reliability Overview