Resilience & Service PatternsService DiscoveryHard⏱️ ~2 min

Registry Consistency: AP vs CP Trade-offs in Service Discovery

Service registries face a fundamental distributed systems choice from the CAP theorem: prioritize availability (AP) or consistency (CP) during network partitions. This decision profoundly affects operational behavior during failures. AP registries like Netflix Eureka favor availability. During a network partition, registry nodes continue accepting reads and writes even if they cannot communicate. When too many heartbeats fail (indicating a likely partition rather than mass instance failure), Eureka enters self preservation mode, refusing to evict instances. This prevents cascading failures where a registry partition incorrectly removes half your fleet. The cost is eventual consistency: clients in different partitions may see different endpoint lists for 30 to 90 seconds. Netflix accepts this because misrouting to a stale endpoint is recoverable (connection fails, client retries), but losing registry availability during a partition would black hole all traffic. CP registries like etcd or Consul with strong consistency modes require quorum for writes and reads. You get precise membership: when an instance is evicted, all clients see the update immediately and consistently. However, during a partition that splits the quorum, the registry becomes unavailable. Write operations block or fail, preventing new registrations and health updates. If your registry is down, services cannot discover endpoints. The operational impact is significant. AP systems require clients to handle stale data gracefully: implement connection timeouts of 1 to 5 seconds, retry with exponential backoff, and use client side health checks like circuit breakers. CP systems require robust fallback to cached endpoints with bounded TTL (typically 30 to 300 seconds) so traffic continues during registry unavailability. At Netflix scale with tens of thousands of instances, an AP registry remains available through datacenter partitions, accepting that 10 to 20% of requests might hit stale endpoints temporarily versus total outage.
💡 Key Takeaways
AP registries like Netflix Eureka stay available during partitions but serve stale endpoints for 30 to 90 seconds; clients need 1 to 5 second connection timeouts and retry logic to handle failures
CP registries guarantee consistency but become unavailable during quorum loss; requires fallback to cached endpoints with 30 to 300 second TTL to continue routing traffic
Self preservation mode in AP systems stops evictions when heartbeat loss exceeds threshold (typically 15% of fleet), preventing mass removal during network issues versus real failures
At Netflix scale with tens of thousands of instances, AP choice means accepting 10 to 20% temporary misroutes during partitions versus complete traffic blackout with CP unavailability
Strong consistency adds write latency: CP systems require quorum acknowledgment adding 5 to 50 milliseconds per registration versus AP immediate local writes under 1 millisecond
📌 Examples
Netflix Eureka uses AP model with self preservation, during a datacenter partition it continues serving cached endpoints accepting temporary inconsistency over downtime
etcd with strong consistency requires 3 or 5 node quorum; losing 2 of 3 nodes makes registry unavailable for writes, clients must use cached endpoints until recovery
Consul can operate in both modes: default requires leader for writes (CP) but can enable stale reads (AP) allowing queries during leader election
← Back to Service Discovery Overview
Registry Consistency: AP vs CP Trade-offs in Service Discovery | Service Discovery - System Overflow