Database Design • Database Selection FrameworkMedium⏱️ ~3 min
Service Level Objectives (SLOs) and Scale Planning for Database Selection
Concrete Service Level Objectives (SLOs) transform abstract requirements into measurable constraints that filter database candidates. Define percentile latencies with actual numbers: p50 read under 10 milliseconds, p95 read under 20 milliseconds, p99 read under 50 milliseconds, p99 write under 100 milliseconds. Specify throughput as peak queries per second (QPS) or transactions per second (TPS): 200K read QPS, 50K write QPS. Set concurrent connection limits: 10K active connections per node. Include availability targets like 99.9% uptime (43 minutes downtime per month) or 99.99% (4 minutes per month).
Recovery objectives define failure tolerance. Recovery Point Objective (RPO) sets acceptable data loss: RPO of 1 minute means losing at most 60 seconds of writes during failover. Recovery Time Objective (RTO) caps downtime: RTO of 15 minutes means restoring service within a quarter hour. These numbers directly constrain replication strategy. An RPO under 1 second requires synchronous replication to at least one replica, adding 5 to 20 milliseconds to write latency. An RPO of 5 minutes allows asynchronous replication with sub millisecond write acknowledgment but risks losing recent writes on primary failure.
Scale planning requires modeling growth trajectories and geographic distribution. If you project 10 terabytes today growing to 100 terabytes in 18 months, plan for 10x headroom. If 60% of traffic originates from US East and 30% from Europe, a single region database violates latency SLOs for European users who experience 80 to 150 millisecond cross Atlantic round trips. Multi region deployment becomes mandatory. Cost modeling ties this together: estimate total cost of ownership (TCO) including compute at $0.10 per vCPU hour, storage at $0.10 per gigabyte per month, provisioned Input Output Operations Per Second (IOPS) at $0.05 per IOPS per month, and cross region egress at $0.02 per gigabyte.
Netflix targets p99 latencies in the low tens of milliseconds within a region for user activity queries at peak loads exceeding 1 million QPS globally. Their multi region active active Cassandra deployment trades strong consistency for availability, accepting cross region replication lag of 100 milliseconds to seconds. This aligns with their SLO: temporary staleness in viewing history is acceptable, but local writes must be fast and always available even during region failures.
💡 Key Takeaways
•Percentile latencies drive architecture: p99 under 50 milliseconds eliminates databases with high tail latencies from garbage collection pauses or compaction stalls, p95 under 20 milliseconds requires SSD backed storage and avoids disk seeks
•RPO and RTO constrain replication: RPO under 1 second forces synchronous replication adding 5 to 20 milliseconds write latency, RPO of 5 minutes allows async replication with sub millisecond writes but risks data loss during failover
•Geographic distribution impacts consistency: 60% US traffic and 30% Europe traffic with p95 latency under 30 milliseconds requires multi region deployment, single region adds 80 to 150 milliseconds cross Atlantic round trip time (RTT)
•Throughput headroom prevents saturation: operate at 50 to 60% of capacity during normal load, leaving buffer for traffic spikes and node failures; 200K QPS target requires provisioning for 350K QPS peak with N+2 node redundancy
•TCO modeling includes hidden costs: cross region egress at $0.02 per gigabyte can exceed compute costs for read heavy workloads with 10 terabytes daily replication, provisioned IOPS at $0.05 per IOPS monthly adds $3000 per month for 60K IOPS database
📌 Examples
E-commerce inventory system SLOs: p99 write under 50 milliseconds for stock updates, 50K write QPS peak during flash sales, strong consistency to prevent overselling, RPO of 0 seconds (synchronous replication), RTO under 1 minute (hot standby), results in choice of regional PostgreSQL with synchronous replica over eventually consistent Cassandra
Social feed service SLOs: p95 read under 20 milliseconds for timeline fetches, 500K read QPS globally, eventual consistency acceptable (users tolerate 1 to 2 second staleness), RPO of 10 seconds, RTO of 5 minutes, results in choice of multi region DynamoDB with global tables over Spanner to minimize read latency