Load BalancingGlobal Load BalancingHard⏱️ ~3 min

Failure Modes and Edge Cases in Global Load Balancing

Global Load Balancing introduces failure modes that rarely appear in single region systems. The most insidious is DNS caching staleness: even with 20 second TTLs, some resolvers ignore the directive and cache for minutes or hours. During a regional outage, these users continue hitting the failed region until their resolver refreshes. Enterprise networks are particularly problematic: a company's internal DNS server in New York might cache aggressively, sending all employees to a dead region while the public Internet has already failed over. Extended DNS Client Subnet (EDNS) helps but isn't universally deployed. Brownout scenarios create another trap. A region accepts TCP connections and returns HTTP 200 responses but serves high tail latency or subtly corrupted data. Shallow health checks that merely verify TCP connectivity or check a lightweight endpoint miss this. The Global Load Balancer continues routing traffic while user experience degrades silently. Production systems require deep health checks that exercise actual dependencies (databases, caches, downstream services) and validate Service Level Indicator (SLI) metrics like p95 latency and error rates from multiple vantage points. Amazon's health checks probe from globally distributed locations every 10 to 30 seconds and require multiple consecutive failures before marking a region down. Flapping creates cascading failures through positive feedback loops. Overly sensitive health thresholds cause traffic to oscillate between regions: region A gets loaded, health check fails, traffic moves to region B, which then gets overloaded and fails, traffic returns to region A. The oscillation amplifies load and can bring down all regions. Production systems add hysteresis (requiring multiple consecutive state changes), dampening (exponential backoff on weight changes), and rate limits capping regional weight changes to 5 to 10% per minute. Netflix's Chaos Kong drills specifically test these boundaries by simulating complete regional failures and measuring whether traffic shifts overwhelm survivor regions. The most dangerous scenario is thundering herd during failover. When a major region fails, suddenly shifting 100% of load to survivors can exceed their capacity. Systems that run hot at 70 to 80% utilization in steady state lack headroom for 50% load increases. The solution requires multiple layers: maintain 20 to 30% spare capacity per region, implement overload protection with queue limits and circuit breakers, and degrade non critical features automatically. Google's systems use admission control at edges to shed load before it reaches overloaded backends, preserving core functionality while degrading auxiliary features.
💡 Key Takeaways
DNS caching staleness can delay failover for minutes even with 20 second TTLs because some resolvers, especially corporate networks, ignore low TTL directives
Brownout failures where regions accept connections but serve degraded responses require deep health checks that validate actual SLI metrics like p95 latency and error rates
Flapping occurs when overly sensitive thresholds cause traffic oscillation between regions; production systems cap weight changes to 5 to 10% per minute with hysteresis
Thundering herd during failover can overwhelm survivor regions; Netflix maintains 30% spare capacity per region and Google uses admission control to shed load at edges
Resolver to user location mismatch causes suboptimal routing when corporate or ISP DNS servers are geographically distant from end users, requiring EDNS Client Subnet support
Anycast path anomalies from BGP route leaks can steer users to distant PoPs, increasing latency by hundreds of milliseconds until detected via Real User Measurement (RUM)
📌 Examples
During a 2021 incident, a provider's DNS servers cached stale records for 15 minutes despite 30 second TTLs, keeping 20% of traffic hitting a failed region
A streaming service's health check only verified TCP connectivity; the region served 500 errors for 8 minutes before monitors detected high error rates and triggered failover
Chaos Kong drill showed shifting 100% load from one Netflix region overloaded survivors until they implemented gradual traffic ramps and 30% capacity headroom per region
← Back to Global Load Balancing Overview