Failure Modes and Edge Cases in Global Load Balancing

DNS Caching Staleness
The most insidious GLB failure mode is DNS caching staleness. Even with 20-second TTLs, some resolvers ignore the directive and cache for minutes or hours. During a regional outage, these users continue hitting the failed region until their resolver refreshes. Enterprise networks are particularly problematic: a corporate DNS server might cache aggressively, sending all employees to a dead region while the public Internet has already failed over. EDNS Client Subnet helps by including client IP in DNS queries, but deployment is not universal.
Brownout and Gray Failure Scenarios
Brownouts create a subtler trap. A region accepts TCP connections and returns HTTP 200 responses but serves high tail latency or subtly corrupted data. Shallow health checks that merely verify TCP connectivity or check a lightweight endpoint miss this. The GLB continues routing traffic while user experience degrades silently. Production systems require deep health checks that exercise actual dependencies (databases, caches, downstream services) and validate SLI metrics (p95 latency, error rates) from multiple vantage points. Health checks should probe every 10-30 seconds from globally distributed locations and require multiple consecutive failures before marking a region down.
Flapping and Cascading Failures
Flapping creates cascading failures through positive feedback loops. Overly sensitive health thresholds cause traffic to oscillate: region A gets loaded, health check fails, traffic moves to region B, which then overloads and fails, traffic returns to region A. The oscillation amplifies load and can bring down all regions. Prevention requires: hysteresis (multiple consecutive state changes required), dampening (exponential backoff on weight changes), and rate limits capping regional weight changes to 5-10% per minute. Test regional evacuation procedures regularly; systems should handle complete regional failure without oscillation.
Thundering Herd During Failover
The most dangerous scenario is thundering herd during failover. When a major region fails, suddenly shifting 100% of its load to survivors can exceed their capacity. Systems running at 70-80% utilization lack headroom for 50% load increases. The solution requires multiple layers: maintain 20-30% spare capacity per region to absorb failover, implement overload protection with queue limits and circuit breakers, and degrade non-critical features automatically. Admission control at edges sheds load before it reaches overloaded backends, preserving core functionality.
Split-Brain and Partition Scenarios
Network partitions between regions create split-brain scenarios. If regions cannot communicate but both remain operational, they may serve divergent data. Users on different sides of the partition see inconsistent state. Detection requires health checks from multiple vantage points (not just the regions themselves) and clear policies for which regions should continue serving during partitions. Multi-region databases with synchronous replication may halt writes entirely during partitions to preserve consistency, trading availability for correctness.
Key Trade-off: Aggressive failover catches problems quickly but risks oscillation and thundering herd. Conservative failover provides stability but extends user impact during real outages. Balance with hysteresis, rate limits, and spare capacity to absorb sudden load shifts.

💡 Key Takeaways

✓DNS staleness: resolvers may cache minutes despite 20s TTL; enterprise DNS particularly problematic during regional failover

✓Brownouts: region returns 200 but has high tail latency; shallow health checks miss this; need deep checks validating p95 and dependencies

✓Flapping prevention: hysteresis (multiple state changes), dampening (exponential backoff), rate limits (5-10% weight change/minute)

✓Thundering herd: maintain 20-30% spare capacity per region; implement admission control and graceful degradation

📌 Interview Tips

1Explain DNS staleness: corporate DNS caches for 5 minutes despite 20s TTL, employees hit dead region during failover

2Describe brownout detection: region returns HTTP 200 but p95 latency is 10x normal; deep health check catches this

3Walk through flapping cascade: region A overloads, fails over to B, B overloads, fails back to A, both eventually fail

← Back to Global Load Balancing Overview