Common Failure Modes and Mitigation Strategies in Service Discovery
Registry Unavailability
When the registry is unreachable, new instances cannot register and topology changes are not discoverable. Mitigation: clients cache last known good data and continue operating with stale information. Services should start with a bootstrap cache (list of known instances) so they can function even if the registry is down at startup. Target registry availability of 99.99% to minimize these windows.
Stale Cache Routing
Clients route to instances that have failed but remain in the cache. During the propagation delay (health check + notification), traffic goes to dead instances. Mitigation: client side health checks detect unresponsive instances locally. Circuit breakers stop routing to failing instances. Retry with different instances on failure. Accept that some requests will fail during the propagation window.
Registration Failures
A service starts but fails to register. It is healthy and accepting connections but invisible to discovery. Callers cannot route to it. Mitigation: fail service startup if registration fails (do not run unregistered). Implement registration retries with exponential backoff. Monitor for instances running but not registered. Alert on registration failure rates.
Split Brain Scenarios
Network partitions can create split brain where different registry nodes have different views. Clients connected to different nodes see different instance lists. Some clients can reach services others cannot. Mitigation: use quorum based registries, implement client side detection of anomalous routing patterns, and design for eventual consistency.
Thundering Herd on Recovery
When a registry recovers from an outage, all clients may simultaneously reconnect and fetch updates. This can overwhelm the registry. Mitigation: jittered reconnection delays spread load over time. Rate limit client connections during recovery. Pre warm registry caches before accepting traffic.