AP Systems: Choosing Availability Over Consistency

When Uptime Trumps Perfect Consistency:

AP systems continue serving requests during network partitions, accepting that different nodes may return stale data or accept conflicting writes. The system prioritizes staying responsive over maintaining linearizability, then reconciles divergent state after the partition heals.

This approach maximizes uptime, keeps tail latencies low even during failures, and handles traffic spikes gracefully. But it introduces anomalies: stale reads, write conflicts, lost updates under naive merge strategies, and temporary invariant violations.

How Dynamo Style Systems Work:

Amazon's original Dynamo paper (ancestor of DynamoDB) described a shopping cart system using N=3 replicas with sloppy quorum. Commonly configured as R=2, W=2, but crucially, if a partition prevents reaching quorum, the system accepts writes anyway and marks them for later reconciliation via hinted handoff.

Reported targets included p99 or p99.9 latencies under a few hundred milliseconds even during node or link failures. The cart must always load because availability directly impacts revenue. If a customer adds items to their cart on both sides of a partition, the system merges both carts when the partition heals, preferring duplicate items over lost additions.

✓ In Practice: DynamoDB Global Tables span multiple regions with asynchronous replication. Last writer wins conflict resolution delivers sub second to low second cross region propagation. Strongly consistent reads are offered within region, but cross region reads are eventually consistent to maintain availability.
Conflict Resolution Strategies:

Last writer wins (LWW) is simplest but loses concurrent updates if timestamps are close or clocks skew. For counters, sets, and maps, Conflict Free Replicated Data Types (CRDTs) provide commutative merge functions that handle concurrent updates correctly. Shopping carts use a "grow only" set where removes are marked rather than deleted, allowing deterministic merging.

Cassandra, inspired by Dynamo, offers tunable consistency per query. With consistency level ONE, you get single digit millisecond median latency and high availability, but expect stale reads. With consistency level QUORUM (R=2, W=2 for N=3), you achieve strong consistency when quorums are reachable, but tail latency grows because you wait for the slowest replica. Under partition, you can still serve at consistency level ONE.

When to Choose AP:

Select AP for feeds, carts, analytics, counters, and caches where occasional anomalies are acceptable and user experience can tolerate "pending" states or compensation logic. Design your data model to make conflicts rare or easily mergeable, and build UX that handles temporary inconsistencies gracefully.

💡 Key Takeaways

•AP systems accept writes during partitions even when quorum is unreachable, using hinted handoff and anti entropy to reconcile after partition heals, prioritizing availability over consistency

•Amazon Dynamo (shopping cart) targeted p99 latencies under a few hundred milliseconds during failures with N=3, commonly R=2, W=2, but accepted ONE replica writes during partitions to keep carts always available

•Last writer wins conflict resolution is simplest but loses concurrent updates; CRDTs (Conflict Free Replicated Data Types) provide commutative merges for counters, sets, and maps that handle concurrency correctly

•Cassandra offers tunable consistency per query: consistency level ONE delivers single digit ms median latency with high availability but stale reads, while QUORUM achieves strong consistency when available but higher tail latency

•DynamoDB Global Tables use asynchronous multi region replication with last writer wins, delivering sub second to low second cross region propagation while maintaining availability during region failures

📌 Examples

Shopping cart using AP: customer adds item A in US region and item B in EU region during transatlantic partition. Both writes succeed locally. After heal, cart contains both items A and B via merge

Social media timeline: user posts update that takes 2 seconds to propagate across regions. Followers in distant regions see stale timeline briefly, but all eventually see new post. Availability more important than instant consistency

Cassandra at consistency level ONE: read hits nearest replica, returns in 3ms median but might be stale. Same query at QUORUM waits for 2/3 replicas, returns in 8ms median with strong consistency

DynamoDB Global Tables: application writes to US East region, replicates to EU West in 800ms. During US East failure, EU West continues serving reads and writes, resolving conflicts via last writer wins

← Back to CAP Theorem Overview