Building High Availability: Redundancy, Failover, and Blast Radius Control

High availability is achieved through redundancy across failure domains (racks, availability zones, regions) combined with rapid failure detection and automated recovery. Active active deployments serve traffic from multiple independent replicas, while active passive keeps standby replicas ready to assume load. The math matters: two independent instances each at 99% availability in parallel yield approximately 1 minus 0.01 squared, or 99.99% availability. However, correlated failures (shared power, network switch, OS version, or deployment wave) invalidate independence assumptions, collapsing the entire tier simultaneously. Google Cloud Spanner offers 99.99% availability for regional configs and 99.999% for multi region, achieved by spreading replicas across geographically separated zones and maintaining majority quorum even during zone failures.

Capacity headroom is essential: maintain 30 to 50% spare capacity in each failure domain so that when one domain fails, remaining domains absorb the load without saturation. Netflix operates active active across multiple AWS regions, regularly running chaos experiments like region evacuations to validate failover paths. Their circuit breakers and bulkhead patterns (popularized via Hystrix) isolate failures to prevent cascading collapses. Detection and recovery targets should be explicit: detect instance failure within 10 seconds, AZ failure within 30 seconds, and execute region failover within 5 minutes. Health checks must be multi signal, probing user visible endpoints and including dependency liveness and queue depth, to avoid marking unhealthy instances as healthy. Cell based architectures shard users or tenants into isolated cells with independent control planes, reducing blast radius at the cost of capacity fragmentation and operational complexity. When one cell fails or requires maintenance, other cells remain unaffected.

💡 Key Takeaways

•Parallel redundancy boosts availability: two 99% instances yield 99.99% if failures are independent, calculated as 1 minus the product of failure probabilities (1 minus 0.01 squared).

•Correlated failures (shared rack, power, software version, deployment) eliminate independence, causing simultaneous failures across all replicas and invalidating parallel availability gains.

•Maintain 30 to 50% spare capacity per failure domain so remaining domains can absorb traffic during failover without saturating and degrading latency or triggering overload failures.

•Define detection and recovery targets explicitly: instance failure detected in 10 seconds, AZ failure in 30 seconds, and region failover completed within 5 minutes.

•Multi signal health checks probe user visible endpoints, dependency liveness, and queue backlog thresholds to avoid false positives that mark unhealthy instances as ready.

•Cell or bulkhead architectures isolate users into independent cells with separate control planes, reducing blast radius but increasing capacity fragmentation and operational overhead.

📌 Examples

A stateless API behind two independent instances at 99% each, fronted by an HA load balancer, can achieve approximately 99.99% for the instance tier if the LB and shared components are not single points of failure.

Netflix runs active active across AWS regions and regularly performs region evacuation drills (chaos experiments) to ensure their systems remain available during full region outages.

Google Cloud Spanner regional configurations offer 99.99% availability SLA, while multi region configurations boost that to 99.999%, tolerating full zone or region failures via quorum replication.

Amazon DynamoDB replicates data across three availability zones within a region, allowing the system to tolerate one AZ failure while maintaining single digit millisecond p50 latencies.

← Back to Availability & Reliability Overview