Tuning Detection Speed vs Stability Trade-offs
The Fundamental Trade-off
Failure detection speed fundamentally trades off against stability. Aggressive detection with short intervals and low thresholds reduces MTTR (Mean Time To Recovery, the average time to restore service after failure) but increases false positives during network jitter or garbage collection pauses. False positives cause flapping, where instances repeatedly join and leave the load balancer pool, creating connection churn and retry storms that can be worse than the original problem.
Detection Time Formula
Detection time equals: interval × unhealthy_threshold + timeout + propagation_delay. With 30-second intervals, 2 failure threshold, and 5-second timeout, detection takes roughly 65-70 seconds (30s + 30s + 5s timeout on second attempt + propagation). Tuning to 5-second intervals with 2 failures reduces this to ~15 seconds, but now you probe 6x more often and any transient issue has 6x more chances to trigger false removal. The probe load on backends also increases proportionally.
Hysteresis to Prevent Flapping
Hysteresis uses different thresholds for marking unhealthy versus healthy. For example, 2 failures to remove but 5 successes to restore. This asymmetry prevents oscillation at capacity boundaries. An instance under heavy load might fail 2 checks, get removed, recover immediately without load, pass 1 check, rejoin, get overloaded again, fail 2 checks, and repeat. With 5 success threshold for restoration, the instance must demonstrate sustained stability before rejoining. At 10-second intervals, restoration takes 50-60 seconds, but this prevents the destructive flapping cycle.
Phi Accrual Failure Detectors
Phi accrual failure detectors treat failure as a suspicion level rather than binary timeout. Instead of marking a node failed after N seconds of silence, they calculate a phi value representing the probability that the node has failed given the observed heartbeat pattern. With 1-second heartbeats and phi threshold of 8 (meaning roughly 1 in 100 false positive rate), typical detection takes 8-12 seconds. The key advantage: phi detectors adapt to variance. If garbage collection pauses occasionally cause 2-second gaps, the detector learns this pattern and tolerates it without false positives, while still detecting genuine failures when heartbeats stop entirely.
Multi-Vantage Point Checking
Single-location probing causes split-brain during network partitions: one zone incorrectly removes healthy peers it cannot reach. Multi-vantage point checking requires M of N probe locations to fail before global removal, otherwise only demoting from the affected zone. For example, probing from 3 availability zones and requiring 2/3 to fail before removal. This prevents false positives during localized network issues while maintaining fast detection of genuine failures visible from all vantage points.