Design Fundamentals • Availability & ReliabilityHard⏱️ ~3 min
CAP Theorem Trade offs: Availability vs Consistency Under Partitions
The CAP theorem states that under network partitions, a distributed system must choose between availability (every request receives a response) and consistency (all nodes see the same data). AP systems like Amazon Dynamo and Cassandra prioritize availability, continuing to serve reads and writes during partitions by accepting eventual consistency, sloppy quorum, and hinted handoff. These systems achieve single digit millisecond p50 latencies within a region but risk serving stale or conflicting data until convergence. CP systems like Google Spanner, etcd, and ZooKeeper maintain strong consistency via quorum protocols (Paxos, Raft), rejecting operations that cannot achieve majority agreement during partitions. This guarantees correctness but adds tens of milliseconds for cross AZ quorum writes and can make the system unavailable during certain partition scenarios.
Choose AP for user facing features where eventual consistency is acceptable: social media feeds, shopping carts, recommendations, and session state. These workloads benefit from low latency and can tolerate temporary inconsistencies that resolve through conflict resolution (last write wins, vector clocks, CRDTs). Choose CP for invariants that must never break: financial transactions, inventory counts, seat reservations, and compliance audit logs. The cost of choosing CP is increased tail latency and potential unavailability during network splits. Google Spanner adds roughly 10 to 50 milliseconds to write latency for cross region quorum and commit wait to enforce external consistency via TrueTime. If your p99 latency budget is 200 ms and cross region writes add 60 ms, you may need to localize write leaders or relax consistency. Hybrid approaches exist: perform strongly consistent writes to a primary region and asynchronously replicate to secondary regions, offering low latency reads globally while accepting nonzero RPO (recovery point objective) for disaster recovery scenarios.
💡 Key Takeaways
•CAP theorem forces a choice under partitions: AP systems remain available but accept eventual consistency, while CP systems maintain consistency but may become unavailable.
•AP systems like Dynamo and Cassandra serve requests during partitions with single digit millisecond p50 latencies but risk stale or conflicting data until eventual convergence.
•CP systems like Spanner and etcd use quorum protocols (Paxos, Raft) to guarantee strong consistency, adding tens of milliseconds for cross AZ or cross region writes.
•Choose AP for user facing ephemeral or mergeable data (feeds, carts, recommendations) where temporary inconsistency is acceptable and low latency is critical.
•Choose CP for financial transfers, inventory, reservations, and audit logs where invariants must never break, accepting higher latency and potential unavailability during partitions.
•Hybrid approaches localize strongly consistent writes to a primary region and asynchronously replicate globally, balancing low latency reads with nonzero RPO for disaster recovery.
📌 Examples
Amazon Dynamo (AP) uses sloppy quorum and hinted handoff to remain available during partitions, resolving conflicts via vector clocks or last write wins, suitable for shopping carts.
Google Spanner (CP) enforces strong consistency and external consistency via TrueTime and quorum writes, adding 10 to 50 ms latency but ensuring correctness for financial ledgers.
A ride sharing app might use CP for driver location and trip state (correctness critical) but AP for user notifications and feed updates (eventual consistency acceptable).
Cassandra allows tunable consistency per query: QUORUM reads/writes for CP behavior when needed, ONE or LOCAL_ONE for AP behavior when low latency and availability are priorities.