Resilience & Service Patterns • Service DiscoveryHard⏱️ ~3 min
Common Failure Modes and Mitigation Strategies in Service Discovery
Service discovery systems fail in predictable ways under real world conditions. Understanding these failure modes and their mitigations separates production ready systems from fragile ones.
Stale cache blackholing is the most common issue. A client caches endpoints with a 60 second TTL. An instance crashes, but the client keeps sending traffic to the dead endpoint for up to 60 seconds, causing timeouts. Mitigation requires layered defense: reduce cache TTL to 10 to 30 seconds, implement connection level timeouts of 1 to 3 seconds, use circuit breakers that open after 5 consecutive failures within 10 seconds, and prefer push based updates for critical paths to propagate failures in under 1 second.
Thundering herds occur when many clients refresh simultaneously. If 10,000 clients have synchronized 30 second TTLs, they all query the registry at the same moment, creating a 10,000x traffic spike. The registry overloads, slowing responses, causing client timeouts, triggering retries, amplifying the storm. Mitigation uses jittered TTLs: instead of exactly 30 seconds, use 25 to 35 seconds randomly distributed. Add exponential backoff on failures: retry at 1 second, 2 seconds, 4 seconds, not immediately.
Graceful shutdown gaps cause connection resets. An instance deregisters from the registry, which pushes updates to clients within 1 second. But the instance has 500 open HTTP/2 streams that will run for another 30 seconds. New requests stop arriving immediately, but existing connections break when the instance terminates, causing user facing errors. Solution requires drain windows: after deregistering, keep serving existing connections for 30 to 120 seconds (matching connection lifetime) before terminating. Enforce maximum connection age on servers so connections naturally rotate every 5 to 10 minutes.
Zonal failures without locality awareness cascade. Your payment service has 30 instances in zone A and 30 in zone B. Zone A loses power. Without zone aware routing, all traffic shifts to zone B's 30 instances, doubling their load instantly. They overload, latency spikes from 10 milliseconds to 500 milliseconds, causing upstream timeouts and failures. Implement capped spillover: prefer local zone, allow only 10 to 20% cross zone traffic normally, and gradually ramp cross zone percentage (add 10% every 10 seconds) during failures to avoid shock loading.
💡 Key Takeaways
•Stale caches cause blackholing for up to 60 seconds; mitigate with 10 to 30 second TTL, 1 to 3 second connection timeouts, and circuit breakers opening after 5 failures in 10 seconds
•Thundering herds occur when 10,000 clients with synchronized TTLs query simultaneously; use jittered TTL (25 to 35 seconds randomized) and exponential backoff (1s, 2s, 4s, 8s)
•Graceful shutdown gaps break 500 open connections despite deregistration; implement 30 to 120 second drain windows matching connection lifetime and enforce max connection age of 5 to 10 minutes
•Zonal failures without locality awareness double load on surviving zone instantly, spiking latency from 10ms to 500ms; use capped 10 to 20% cross zone spillover with gradual ramp (add 10% per 10 seconds)
•Registry overload during deploy creates write storms when 5,000 instances heartbeat simultaneously; stagger deployments to 5 to 10% of fleet at a time with 30 to 60 second waves
📌 Examples
Netflix uses zone aware routing keeping 90%+ traffic in zone, with spillover only when local capacity drops below 80%, preventing cascading zone failures
Kubernetes preStop hooks delay pod termination for 30 seconds after deregistration, allowing time for kube proxy to update iptables rules and drain connections
Google Maglev consistent hashing quickly remaps traffic when backends fail, but includes slow start for new instances (10% capacity initially, ramp over 2 minutes) to avoid overload