Load Balancing • Health Checks & Failure DetectionMedium⏱️ ~3 min
Active Probing vs Passive Failure Detection
Load balancers and orchestrators detect failures through two complementary approaches: active probing and passive observation. Each trades off detection speed, accuracy, and operational overhead in different ways.
Active checks send periodic probes like HTTP requests, TCP connection attempts, or agent queries from the load balancer to backend instances. AWS load balancers commonly probe every 5 to 30 seconds with configurable timeouts of 2 to 5 seconds and thresholds of 2 to 3 consecutive failures before marking unhealthy. This means at 10 second intervals with 2 failure threshold, detection takes roughly 20 to 25 seconds. The advantage is bounded, predictable load and the ability to detect total hangs even without traffic. The downside is slower reaction to real user impact and potential false positives during network partitions.
Passive checks observe actual request errors and latency to infer endpoint health. Envoy and HAProxy implement outlier ejection that removes hosts after detecting consecutive 5xx responses (typically 3 to 5) or abnormal latency compared to cluster peers. Netflix pairs slow Eureka registry updates (90+ second eviction windows) with client side passive metrics that trip circuit breakers in under 1 second when error rates exceed 50% over a 10 second rolling window with minimum volume of 20 requests. This reacts to actual user impact much faster than synthetic probes.
The best production systems combine both approaches. Google uses active probes at 1 to 10 second cadence with 2 to 3 failure thresholds from multiple regions, achieving 5 to 30 second detection under tuned configs, while also implementing passive request tracking to catch gray failures where health endpoints return 200 but actual requests experience high tail latency due to issues like CPU soft lockups or network interface problems.
💡 Key Takeaways
•Active probes at AWS defaults of 30 second intervals with 2 failure threshold yield 60 to 70 second detection times. Tuning to 5 to 10 second intervals with 2 to 3 thresholds reduces this to 10 to 30 seconds for SLO compliance.
•Passive detection can react in under 1 second to actual traffic failures. Netflix circuit breakers trip at over 50% errors in 10 second windows, catching problems before the 90 second Eureka registry eviction completes.
•Active checks risk false positives during garbage collection pauses or network jitter. Phi accrual failure detectors (used by Cassandra with 1 second heartbeats and phi threshold around 8) treat failure as a suspicion level rather than binary timeout, achieving 8 to 12 second typical detection while tolerating Java stop the world pauses.
•Passive observation can conflate client bugs with server health. A buggy client sending malformed requests causes 4xx errors that look like backend problems. Separate 5xx (server fault) from 4xx (client fault) in outlier detection logic.
•Combining both approaches catches different failure modes. Active probes detect complete hangs without traffic, while passive metrics catch gray failures where health endpoints lie but real requests fail with high tail latency or partial errors.
📌 Examples
Envoy outlier ejection removes hosts after 5 consecutive 5xx responses with a base ejection time of 30 seconds, then gradually reinstates via half open probes. This catches real traffic failures faster than periodic health checks.
Cassandra uses phi accrual failure detector with 1 second heartbeats and phi threshold of 8. This yields roughly 8 to 12 second detection depending on variance, balancing sensitivity against tolerance for Java garbage collection pauses that can freeze a process for seconds.
Google production runs black box probers from multiple regions at 1 to 10 second cadence. They require failures from multiple vantage points before global removal to distinguish local network issues from true service outages, avoiding split brain scenarios during availability zone partitions.