Load BalancingLoad Balancing AlgorithmsHard⏱️ ~3 min

Production Failure Modes and Mitigation Strategies

Load balancing algorithms fail in subtle ways under real world conditions that simple models miss. Understanding these failure modes and their mitigations is critical for production reliability. Hotspot concentration through hashing is a common trap. When thousands of users behind carrier grade NAT (CGNAT) share a handful of public IPs, IP based consistent hashing overloads a few backends. A mobile carrier with 100,000 users might present only 2 to 5 public IP addresses, causing those backends to receive 20,000 to 50,000 RPS while others sit idle. Layer 7 cookie based affinity or incorporating user ID into hash keys provides better distribution. Metric staleness causes oscillation with dynamic algorithms. If load metrics propagate every 5 to 10 seconds and routing decisions use stale data, backends flip between appearing idle and overloaded. Multiple proxies see the same "idle" backend simultaneously and flood it, then see it overloaded and abandon it, creating cycles. Power of two choices with local per proxy state mitigates this by making decisions on immediately available in flight counts rather than propagated metrics. Exponential moving average (EMA) smoothing with 2 to 5 second half life further dampens oscillation. Retry storms amplify under sticky algorithms. When a backend becomes slow (garbage collection pause, disk stall), clients retry. With consistent hashing, retries hit the same struggling backend, making the problem worse. The backend's queue grows, causing more timeouts, triggering more retries in a positive feedback loop. Production mitigation requires retry budgets (limit cluster wide retries to 1.5x to 2x base traffic), jittered exponential backoff (randomize retry timing to avoid thundering herd), and retry diversification where retries deliberately choose different backends than the original attempt. HTTP/2 connection counting breaks least connections at Layer 4. A single HTTP/2 connection can multiplex 100 concurrent requests. A Layer 4 load balancer counting TCP connections sees one connection and considers that backend lightly loaded, while the backend actually handles 100 requests. This causes severe imbalance. The fix requires Layer 7 awareness to count concurrent streams or in flight requests rather than bare connection count, or setting maximum streams per connection limits (e.g., 10 to 20) and forcing clients to open multiple connections.
💡 Key Takeaways
Hotspot via CGNAT: 100,000 users from mobile carrier collapse to 2 to 5 public IPs. IP hash sends 20,000 to 50,000 RPS to those backends while others idle. Mitigation: Layer 7 cookie affinity or user ID in hash
Metric staleness oscillation: 5 to 10 second metric propagation causes proxies to flood backends that appear idle then abandon when they appear overloaded. Fix: Power of two choices with local state and 2 to 5 second EMA smoothing
Retry storm amplification: Slow backend triggers timeouts. Sticky hashing sends retries to same backend, growing queue and causing more failures. Require retry budgets limiting cluster retries to 1.5x to 2x base load and retry diversification to different backends
HTTP/2 connection counting: Single TCP connection multiplexes 100 streams. Layer 4 least connections sees 1 connection (light load) but backend handles 100 requests (heavy load). Fix: Layer 7 stream counting or limit max streams per connection to 10 to 20
Long lived connection imbalance: WebSocket or gRPC connections lasting hours pin to backends via flow hash. Scaling adds backends but existing connections don't rebalance. Mitigation: Connection draining windows of 60 to 300 seconds on scale events or client side periodic reconnect every 30 to 60 minutes
📌 Examples
Netflix retry storm 2019: Backend latency spike caused client retries. Consistent hashing pinned retries to same backends, amplifying overload. Deployed retry budget limiting total retries to 150% of base traffic and retry diversification to random backends, cutting incident duration from 45 minutes to 8 minutes
Microsoft service fabric: Metric lag of 10 seconds caused load oscillation with global least load algorithm. Switched to power of two choices with per proxy state, reducing p99 latency spikes from 2 seconds to 400ms
E-commerce mobile app: 80,000 concurrent users from carrier CGNAT with 3 public IPs. IP hash sent 26,000 users to 3 of 20 backends (130 RPS vs 50 RPS), causing 5% error rate. Moved to cookie affinity, evening load to 100 RPS per backend and dropping errors to 0.1%
← Back to Load Balancing Algorithms Overview