Load Balancing • Health Checks & Failure DetectionHard⏱️ ~3 min
Tuning Detection Speed vs Stability Trade-offs
Failure detection speed fundamentally trades off against stability. Aggressive detection with short intervals and low thresholds reduces Mean Time To Recovery (MTTR) but increases false positives during network jitter or garbage collection pauses, causing flapping that can be worse than the original problem.
Consider the math of detection time: interval multiplied by unhealthy threshold plus timeout plus propagation delay. AWS load balancers at 30 second intervals with 2 failure threshold and 5 second timeout yield roughly 65 to 70 seconds before marking unhealthy (30s + 30s + 5s timeout on second attempt + a few seconds propagation). Tuning to 5 second intervals with 2 failures reduces this to approximately 10 to 15 seconds, but now you're probing 6 times more often and any transient issue has 6 times more chances to trigger false removal.
The stability cost of aggressive settings appears as flapping, where instances repeatedly join and leave the load balancer pool. Each transition causes connection churn, retry storms, and wasted work. Netflix explicitly chose 30 second heartbeat intervals with 90+ second eviction in Eureka to favor availability over fast removal, then layered fast client side circuit breakers to get subsecond reaction without registry churn. This two tier approach separates slow, stable membership from fast, local load shedding.
At scale, you must account for variance in your system. If 99th percentile network round trip time is 100ms but you set 50ms timeouts, you'll get constant false positives. If garbage collection pauses reach 2 seconds at p99, heartbeat timeouts under 5 seconds will fire spuriously. Google production teams tune thresholds based on observed tail latency and use phi accrual detectors that adapt to arrival pattern variance rather than fixed timeouts. Cassandra phi threshold of 8 with 1 second heartbeats tolerates occasional multi second pauses while still detecting real failures in 8 to 12 seconds.
💡 Key Takeaways
•Detection time equals interval multiplied by threshold plus timeout plus propagation. AWS defaults of 30s interval and 2 failures yield 60 to 70s detection. Reducing interval to 5s achieves 10 to 15s but increases probe load 6x and false positive risk 6x.
•Flapping at capacity boundaries creates retry storms worse than slow failover. Add hysteresis with different thresholds for marking unhealthy versus healthy (e.g., 2 failures to remove, 5 successes to restore). This takes 50 to 60 seconds to restore at 10 second intervals but prevents oscillation.
•Phi accrual failure detectors used by Cassandra treat failure as a suspicion level from 0 to infinity rather than binary timeout. Phi threshold of 8 means roughly 1 in 100 false positive rate. With 1 second heartbeats, typical detection is 8 to 12 seconds, adapting to variance from garbage collection.
•Multi vantage point checking from different availability zones or regions prevents false positives during network partitions. Google requires M of N probe locations to fail before global removal, otherwise only demoting from the affected zone. Single location probing causes split brain where one zone incorrectly removes healthy peers.
•Two tier architectures separate slow stable membership (30 to 90 second updates) from fast local decisions (subsecond circuit breakers). Netflix Eureka registry updates slowly but clients react in under 1 second using passive request metrics, getting both stability and speed.
📌 Examples
A team tuning AWS Application Load Balancer from 30s interval and 10 healthy threshold (300 second restoration time) to 5s interval and 3 healthy threshold reduced restoration to 15 seconds but caused flapping during deployment. They added 10 second connection draining and 5 healthy threshold (25 second restoration) to stabilize.
Cassandra with phi accrual threshold 8 and 1 second heartbeats detects node failures in roughly 8 to 12 seconds depending on network variance. During a Java garbage collection pause of 3 seconds, suspicion level rises but stays below threshold 8, avoiding false positive. A truly dead node crosses threshold 8 within 10 seconds.
Google Site Reliability Engineering (SRE) teams probe from 3 different regions with 5 second intervals. They require 2 of 3 regions reporting failure before global removal. During an availability zone network partition, only 1 region fails checks, so the instance stays in global rotation but is removed from local zone routing.