Common Failure Modes and Mitigation Strategies in Service Discovery

Registry Unavailability
When the registry is unreachable, new instances cannot register and topology changes are not discoverable. Mitigation: clients cache last known good data and continue operating with stale information. Services should start with a bootstrap cache (list of known instances) so they can function even if the registry is down at startup. Target registry availability of 99.99% to minimize these windows.
Stale Cache Routing
Clients route to instances that have failed but remain in the cache. During the propagation delay (health check + notification), traffic goes to dead instances. Mitigation: client side health checks detect unresponsive instances locally. Circuit breakers stop routing to failing instances. Retry with different instances on failure. Accept that some requests will fail during the propagation window.
Registration Failures
A service starts but fails to register. It is healthy and accepting connections but invisible to discovery. Callers cannot route to it. Mitigation: fail service startup if registration fails (do not run unregistered). Implement registration retries with exponential backoff. Monitor for instances running but not registered. Alert on registration failure rates.
✅ Best Practice: Implement the discovery path defensively. Cache aggressively, retry on failure, circuit break on repeated failures, and fall back to stale data rather than failing completely.
Split Brain Scenarios
Network partitions can create split brain where different registry nodes have different views. Clients connected to different nodes see different instance lists. Some clients can reach services others cannot. Mitigation: use quorum based registries, implement client side detection of anomalous routing patterns, and design for eventual consistency.
Thundering Herd on Recovery
When a registry recovers from an outage, all clients may simultaneously reconnect and fetch updates. This can overwhelm the registry. Mitigation: jittered reconnection delays spread load over time. Rate limit client connections during recovery. Pre warm registry caches before accepting traffic.

💡 Key Takeaways

✓Registry unavailability: clients cache last known data; bootstrap cache enables startup even with registry down

✓Stale cache routing: client side health checks and circuit breakers detect dead instances locally during propagation delay

✓Registration failure: fail startup if registration fails, implement retries with backoff, monitor for unregistered running instances

📌 Interview Tips

1Emphasize defensive discovery: cache aggressively, retry, circuit break, fall back to stale rather than fail

2Mention thundering herd on registry recovery: jittered reconnection delays spread reconnection load

3Note split brain during partitions: different clients see different instance lists based on which registry node they reach

← Back to Service Discovery Overview