Resilience & Service PatternsService DiscoveryHard⏱️ ~2 min

Service Discovery at Scale: Implementation Details and Capacity Planning

Operating service discovery at scale requires careful capacity planning and implementation choices. The registry write path throughput determines maximum fleet size. If instances heartbeat every 20 seconds, a registry handling 50,000 instances processes 2,500 heartbeats per second before accounting for retries and jitter. Add 2x headroom for bursts during deployments when instances register and deregister rapidly, requiring 5,000 writes per second capacity. With typical registry write latencies of 5 to 20 milliseconds, you need sufficient CPU and disk I/O to maintain this throughput. The read path scales differently. Avoid per request registry lookups at all costs. If 1,000 application servers make 100 requests per second each, and each request queries the registry, that's 100,000 reads per second. Instead, use hierarchical caching: clients maintain in process caches refreshed every 10 to 30 seconds, or deploy node local agents that fan out updates using push streams. This reduces registry reads to N agents (hundreds) rather than N times M services (millions). Target sub millisecond lookup latency from local cache, accepting 1 to 5 second staleness. Replication topology affects both availability and consistency. Netflix replicates Eureka within a region (typically 3 nodes) for availability, with asynchronous cross region replication for disaster recovery. Google shards registry state by service or region to limit blast radius; a registry failure affects only a subset of services. Kubernetes centralizes registry in etcd with 3 or 5 node quorum, trading write availability during partitions for strong consistency. Observability signals are critical for operating at scale. Monitor registry heartbeat rate (should match expected instance count divided by heartbeat interval), eviction rate (spikes indicate problems), self preservation state (should be rare), and write p99 latency (target under 50 milliseconds). On the data plane, track endpoint set size per service, update to apply latency (time from registry change to client seeing it, target under 5 seconds), percentage of requests to unhealthy endpoints (should be under 1%), and cross zone traffic percentage (typically 10 to 20% steady state).
💡 Key Takeaways
Registry write capacity must handle N instances divided by heartbeat interval plus 2x headroom; 50,000 instances at 20 second heartbeats requires 5,000 writes per second capacity including retries
Per request registry lookups do not scale; 1,000 servers at 100 requests per second creates 100,000 reads per second; use node local agents or in process caching refreshed every 10 to 30 seconds
Hierarchical caching reduces registry load from N times M (millions) to N agents (hundreds); clients achieve sub millisecond lookups from memory accepting 1 to 5 second update delay
Sharding registry state by service or region limits blast radius; failure affects subset of traffic rather than entire fleet like centralized registries
Key observability: heartbeat rate matching fleet size, write p99 latency under 50ms, update propagation under 5 seconds, unhealthy request percentage under 1%, cross zone traffic 10 to 20%
📌 Examples
Netflix Eureka handles tens of thousands of instances with 30 second heartbeats and client side caching, keeping registry reads minimal and lookups sub millisecond from local memory
Kubernetes etcd cluster (3 to 5 nodes) stores all cluster state with watch streams fanning out updates, limiting direct etcd queries to control plane components only
Google shards registry by service and region with streaming xDS updates, maintaining sub second propagation for critical services while keeping registry load distributed
← Back to Service Discovery Overview