Active Probing vs Passive Failure Detection
Two Detection Approaches
Load balancers detect failures through two complementary approaches: active probing and passive observation. Active probing sends synthetic health checks from the load balancer to backends. Passive observation monitors actual request traffic for errors and latency. Each trades off detection speed, accuracy, and operational overhead differently. Production systems typically combine both to catch different failure modes.
Active Health Checks
Active checks send periodic probes like HTTP requests, TCP connection attempts, or agent queries from the load balancer to backend instances. Common configurations probe every 5-30 seconds with timeouts of 2-5 seconds and thresholds of 2-3 consecutive failures before marking unhealthy. At 10-second intervals with 2 failure threshold, detection takes roughly 20-25 seconds. The advantage is bounded, predictable load and ability to detect total hangs even without traffic. The downside is slower reaction to real user impact and potential false positives during network partitions or garbage collection pauses.
Passive Failure Detection
Passive checks observe actual request errors and latency to infer endpoint health. Outlier ejection removes hosts after detecting consecutive 5xx responses (typically 3-5) or abnormal latency compared to cluster peers. Client-side circuit breakers can trip in under 1 second when error rates exceed 50% over a 10-second rolling window with minimum volume of 20 requests. This reacts to actual user impact much faster than synthetic probes. The downside: passive detection can conflate client bugs with server health. A buggy client sending malformed requests causes 4xx errors that look like backend problems. Separate 5xx (server fault) from 4xx (client fault) in outlier detection logic.
Combining Both Approaches
The best production systems combine both approaches. Active probes at 1-10 second cadence from multiple vantage points achieve 5-30 second detection for total failures. Passive request tracking catches gray failures (where health endpoints return 200 but actual requests experience high tail latency due to CPU soft lockups or network interface problems) in under a second. Active probes detect complete hangs without traffic; passive metrics catch degradation that synthetic checks miss.