Database Design • Database Selection FrameworkHard⏱️ ~3 min
Production Database Selection: Amazon Dynamo vs Google Spanner
Amazon Dynamo and Google Spanner represent opposite ends of the consistency versus availability tradeoff spectrum, each optimized for fundamentally different workload requirements. Dynamo prioritizes availability and low latency through eventual consistency and leaderless replication. Configured with replication factor N equals 3, read quorum R equals 2, write quorum W equals 2, it achieves sub 10 millisecond p99 latencies for both reads and writes during normal operation. During network partitions, Dynamo remains write available by accepting writes to any replica, using vector clocks to track causality and resolve conflicts later. Amazon shopping cart uses this model: temporary divergence where two replicas show different item counts is acceptable because user experience prioritizes fast updates and availability over immediate consistency.
Spanner takes the opposite approach, providing externally consistent distributed transactions using TrueTime, Google's globally synchronized clock infrastructure with uncertainty bounds under 7 milliseconds. A transaction that spans multiple regions waits out the commit uncertainty interval, adding 5 to 10 milliseconds within a region and 50 to 100+ milliseconds for cross region transactions. This cost buys strong guarantees: external consistency means if transaction T1 commits before T2 starts, T2 sees T1's writes, matching real time ordering. Google's advertising billing platform uses Spanner because financial correctness is non negotiable. The system accepts higher write latency to guarantee that no advertiser is double charged and all billing events are strictly ordered.
The scaling characteristics differ dramatically. Dynamo scales horizontally through consistent hashing with virtual nodes, automatically distributing data as nodes are added. Rebalancing happens incrementally with minimal impact, and each node handles both reads and writes independently. Spanner scales through range based sharding with Paxos replicated tablets, requiring more coordination but supporting complex SQL queries with joins across shards. Meta's social graph uses a middle ground approach with TAO (The Associations and Objects), a distributed data store providing an eventually consistent graph API backed by sharded MySQL. TAO's cache tier serves billions of requests per second at sub millisecond latencies, absorbing 99%+ of read traffic, while MySQL handles cache misses and all writes with asynchronous cross region replication introducing hundreds of milliseconds of lag.
Choosing between these patterns depends on concrete requirements. If your p99 latency budget is under 20 milliseconds, traffic exceeds 500K QPS, and you can tolerate read your writes consistency within a user session but not global ordering, Dynamo style systems win. If you need multi record transactions, foreign key constraints, and strong consistency across geographic regions despite 50 to 100 millisecond write latencies, Spanner style systems are appropriate. For social graphs or content feeds where relationships dominate and 99%+ of reads hit cache, TAO's cache focused eventual consistency model provides the best cost efficiency.
💡 Key Takeaways
•Dynamo achieves p99 under 10 milliseconds and remains write available during partitions using quorum parameters (N=3, R=2, W=2) and vector clock conflict resolution, suitable for shopping carts and session stores where temporary divergence is acceptable
•Spanner provides external consistency with TrueTime at the cost of 5 to 10 milliseconds intra region and 50 to 100+ milliseconds cross region commit latency, necessary for financial transactions and inventory management requiring strict ordering
•Scaling patterns diverge: Dynamo uses consistent hashing for automatic rebalancing with minimal coordinator overhead, Spanner uses range sharding with Paxos coordination enabling SQL joins but requiring more complex split and merge operations
•Meta's TAO demonstrates a third option: cache tier serving billions of QPS at sub millisecond latencies absorbs 99%+ reads, MySQL handles writes with asynchronous replication introducing 100+ milliseconds lag, optimized for social graph traversals
•Selection criteria are quantitative: p99 latency budget under 20 milliseconds with 500K+ QPS and session level consistency favors Dynamo, multi record transactions across regions with strong consistency despite 100 millisecond latency favors Spanner
📌 Examples
Amazon Prime Video watch history uses DynamoDB (Dynamo variant): accepts eventual consistency where a video marked watched on TV might not immediately show as watched on mobile for 1 to 2 seconds, prioritizes fast writes under 10 milliseconds and availability during AWS Availability Zone failures
Google Cloud billing uses Spanner: requires external consistency so invoice totals across multiple services match real time transaction order, accepts 50 to 100 millisecond write latency for cross region billing events to guarantee financial correctness and auditability