Resilience & Service PatternsService DiscoveryHard⏱️ ~2 min

Service Discovery at Scale: Implementation Details and Capacity Planning

Registry Sizing

Estimate registry capacity: number of services × instances per service × metadata size. A system with 500 services, 10 instances each, and 1KB metadata stores 5MB of registry data. This fits in memory. Query load depends on discovery pattern: pull based scales with clients, push based scales with changes. A registry cluster of 3-5 nodes handles most deployments.

Client Cache Management

Clients maintain local caches to reduce registry load and survive registry outages. Cache size equals services accessed × instances per service. Most clients access 10-50 services, keeping cache small. Implement cache expiration to prevent unbounded growth. Refresh caches on a schedule or via push notifications. Monitor cache hit rates to ensure effectiveness.

Health Check Tuning

Health check frequency affects detection speed and registry load. Checking 5000 instances every 5 seconds generates 1000 checks/second. Balance detection speed against load: critical services get frequent checks (5s), less critical services get less frequent (30s). Consider heartbeat based health where services push status instead of registry pulling.

🎯 When To Use: Dedicated registry for systems with 50+ services or dynamic scaling. For smaller static deployments, DNS or configuration files may suffice.

Multi Region Considerations

Global deployments need regional registries to minimize cross region latency. Each region has its own registry cluster. Services register in their local region. Cross region discovery requires registry federation (registries sync data) or global load balancing. Prefer locality: route to same region instances first, cross region only on local failure.

Monitoring and Alerting

Key metrics: registration success rate, health check pass rate, query latency, cache hit rate, propagation delay. Alert on: registration failures exceeding baseline, health check timeout spikes, query latency exceeding 100ms, instances running but unregistered. Dashboard the service topology to visualize dependencies.

💡 Key Takeaways
Registry sizing: 500 services × 10 instances × 1KB = 5MB data; 3-5 node cluster handles most deployments
Health check load: 5000 instances at 5 second intervals = 1000 checks/second. Tune frequency by service criticality.
Multi region: regional registries minimize latency, federation syncs data, prefer locality with cross region fallback
📌 Interview Tips
1Calculate registry data size: services × instances × metadata size to show memory requirements are modest
2Calculate health check load: instances ÷ check interval = checks per second to size registry capacity
3Mention key metrics: registration success rate, health check pass rate, query latency, propagation delay
← Back to Service Discovery Overview