Load BalancingHealth Checks & Failure DetectionMedium⏱️ ~3 min

Active Probing vs Passive Failure Detection

Two Detection Approaches

Load balancers detect failures through two complementary approaches: active probing and passive observation. Active probing sends synthetic health checks from the load balancer to backends. Passive observation monitors actual request traffic for errors and latency. Each trades off detection speed, accuracy, and operational overhead differently. Production systems typically combine both to catch different failure modes.

Active Health Checks

Active checks send periodic probes like HTTP requests, TCP connection attempts, or agent queries from the load balancer to backend instances. Common configurations probe every 5-30 seconds with timeouts of 2-5 seconds and thresholds of 2-3 consecutive failures before marking unhealthy. At 10-second intervals with 2 failure threshold, detection takes roughly 20-25 seconds. The advantage is bounded, predictable load and ability to detect total hangs even without traffic. The downside is slower reaction to real user impact and potential false positives during network partitions or garbage collection pauses.

Passive Failure Detection

Passive checks observe actual request errors and latency to infer endpoint health. Outlier ejection removes hosts after detecting consecutive 5xx responses (typically 3-5) or abnormal latency compared to cluster peers. Client-side circuit breakers can trip in under 1 second when error rates exceed 50% over a 10-second rolling window with minimum volume of 20 requests. This reacts to actual user impact much faster than synthetic probes. The downside: passive detection can conflate client bugs with server health. A buggy client sending malformed requests causes 4xx errors that look like backend problems. Separate 5xx (server fault) from 4xx (client fault) in outlier detection logic.

Combining Both Approaches

The best production systems combine both approaches. Active probes at 1-10 second cadence from multiple vantage points achieve 5-30 second detection for total failures. Passive request tracking catches gray failures (where health endpoints return 200 but actual requests experience high tail latency due to CPU soft lockups or network interface problems) in under a second. Active probes detect complete hangs without traffic; passive metrics catch degradation that synthetic checks miss.

Key Insight: Active and passive detection catch different failure modes. Active probes detect complete hangs even without traffic. Passive metrics catch gray failures where health endpoints lie but real requests fail. Using both provides comprehensive coverage.
💡 Key Takeaways
Active probes at 10s intervals with 2 failure threshold yield ~20-25s detection; tuning to 5s achieves 10-15s
Passive detection reacts in under 1 second to actual traffic failures via circuit breakers at 50%+ error rates
Separate 5xx (server fault) from 4xx (client fault) in outlier detection to avoid conflating client bugs with server health
Combine both: active probes detect total hangs without traffic; passive metrics catch gray failures health endpoints miss
📌 Interview Tips
1Explain active vs passive tradeoffs: active has bounded predictable load but slower reaction; passive reacts faster but needs traffic
2Describe circuit breaker triggering: 50%+ errors over 10-second rolling window with minimum 20 requests
3Mention that gray failures (200 health but degraded requests) require passive monitoring of actual request latency
← Back to Health Checks & Failure Detection Overview