Database Design • Database Selection FrameworkHard⏱️ ~3 min
Workload Modeling and Empirical Database Benchmarking
Effective database selection requires empirical validation through benchmarking that reproduces production workload characteristics. Model request distributions using real access patterns: Zipfian distributions where the top 10% of keys account for 80% of traffic, object size distributions with p50 at 1 kilobyte and p95 at 100 kilobytes, and read to write ratios like 70% reads to 30% writes during normal operation shifting to 50/50 during peak events. Synthetic uniform random workloads miss critical behaviors like hot key contention and cache effectiveness.
Measure percentile latencies under sustained load at your target throughput. Run benchmarks at 100K queries per second if that is your production target, not at 10K QPS with extrapolation. Capture p50, p95, p99, and p99.9 latencies separately: a system with p50 of 5 milliseconds and p99 of 200 milliseconds has very different user experience than one with p50 of 8 milliseconds and p99 of 15 milliseconds. Monitor secondary metrics: CPU utilization, disk Input Output Operations Per Second (IOPS), network throughput, memory pressure, garbage collection pauses, compaction backlog depth, and replication lag. A database saturating CPU at 90% under target load will fail during traffic spikes.
Inject failures to validate operational characteristics. Terminate a primary node mid transaction and measure failover time: does the system elect a new leader in under 30 seconds as claimed, or does it take 3 minutes causing a prolonged outage? Induce network partitions between data centers and observe whether the system maintains availability (AP in CAP) or consistency (CP), and measure the actual recovery time and data loss (RPO). Throttle disk IO to simulate cloud storage performance degradation and measure impact on write latency and compaction behavior. Netflix's Chaos Monkey approach continuously injects failures in production; pre production benchmarking should replicate this validation.
Project growth trajectories to test scalability limits. If you have 1 terabyte of data today growing at 50% per quarter, benchmark with 5 to 10 terabytes to validate sharding and rebalancing behavior at future scale. Measure the cost of adding capacity: how long does adding a new shard take, what is the rebalancing traffic overhead, how much does query latency degrade during the migration? These operational costs compound over time. A database requiring 6 hours of read only mode to add capacity is unsuitable for 24/7 services.
💡 Key Takeaways
•Reproduce production distributions: Zipfian access patterns where top 10% of keys see 80% of traffic, object size distributions with p50 at 1 kilobyte and p95 at 100 kilobytes, read to write ratios shifting from 70/30 to 50/50 during peak events
•Benchmark at target throughput: run at 100K QPS production target rather than 10K with extrapolation, measure p50/p95/p99/p99.9 separately as p50 of 5 milliseconds with p99 of 200 milliseconds differs drastically from p50 of 8 milliseconds with p99 of 15 milliseconds
•Monitor resource saturation: track CPU utilization, disk IOPS, network throughput, garbage collection pauses, compaction backlog; database at 90% CPU under normal load will fail during traffic spikes requiring headroom at 50 to 60% utilization
•Failure injection validates operational claims: terminate primary node and measure actual failover time (claimed 30 seconds vs observed 3 minutes), induce cross region network partition to test consistency versus availability behavior, throttle disk IO to measure write latency degradation
•Scale testing prevents future bottlenecks: benchmark with 5 to 10 terabytes if current data is 1 terabyte growing 50% per quarter, measure shard rebalancing overhead and query latency impact, database requiring 6 hours read only for capacity addition unsuitable for 24/7 services
📌 Examples
E-commerce benchmark reproduced Black Friday workload: Zipfian distribution with top 100 products accounting for 60% of traffic, object sizes p50 1.5 kilobytes (product JSON) p95 80 kilobytes (with images), 200K QPS peak with 80% reads 20% writes, DynamoDB achieved p99 read 8 milliseconds p99 write 12 milliseconds at 85% provisioned capacity, PostgreSQL saturated CPU at 120K QPS with p99 spiking to 300 milliseconds
Video streaming service tested region failure: 3 region active active Cassandra deployment, simulated entire US East region loss, measured cross region failover in 45 seconds with RPO of 2 seconds (240 lost writes), p99 latency increased from 15 milliseconds to 35 milliseconds during recovery due to increased cross region traffic, validated within RTO of 1 minute and RPO of 5 seconds SLO