Design FundamentalsBack-of-the-envelope CalculationsMedium⏱️ ~3 min

Capacity Sizing and Latency Budgeting Across System Tiers

Translating per second load estimates into actual instance counts and latency allocations requires understanding both vertical capacity (what one instance can handle) and horizontal scaling math. For web and application tiers, a typical server with 32 GB RAM might comfortably handle 5,000 to 10,000 lightweight requests per second at p95 latency below 50ms while maintaining 60 to 70 percent CPU utilization. If your peak load calculation shows 200,000 requests per second, you would need 20 to 40 instances just to meet load, but production systems add a safety factor of 1.5 to 2.0x for headroom and N+1 fault tolerance. This means deploying 30 to 80 instances to ensure traffic from a failed zone can be absorbed without breaching SLOs. Latency budgeting starts with defining your SLO target, such as p95 equals 200ms end to end, then allocating time to each hop in the critical path. A reasonable breakdown might be: CDN or edge layer 20 to 40ms, load balancer 5 to 10ms, application server 50 to 80ms, cache lookup 5 to 10ms, database query 15 to 40ms, and network hops across tiers totaling 20 to 40ms. This sums to roughly 115 to 210ms, leaving 10 to 20 percent slack for variability and unexpected delays. If any component consistently exceeds its budget during load testing, you must either optimize that component or redistribute the budget from faster layers. Little's Law provides a critical sanity check: the number of in flight requests equals arrival rate times service time (L = λ × W). If your application tier receives 10,000 requests per second and p95 service time is 200ms (0.2 seconds), you will have up to 2,000 concurrent in flight requests per instance. Your thread pools, connection pools, and memory allocations must accommodate this concurrency without head of line blocking or resource exhaustion. Undersizing connection pools causes requests to queue and latency to spike; oversizing leads to context switching overhead and increased memory pressure. Real systems like Facebook historically sized app servers with up to 256 GB RAM and multi TB storage to handle these concurrent workloads plus caching layers.
💡 Key Takeaways
Web tier instance capacity: typical server handles 5,000 to 10,000 lightweight requests per second at p95 under 50ms with 60 to 70 percent CPU, requiring 20 to 40 instances for 200,000 RPS peak before safety factors
Safety margins require 1.5 to 2.0x multiplier for headroom and N+1 fault tolerance, so 200,000 RPS load becomes 30 to 80 deployed instances to survive zone failures without SLO violations
Latency budget allocation for p95 200ms SLO: CDN 20 to 40ms, load balancer 5 to 10ms, app server 50 to 80ms, cache 5 to 10ms, database 15 to 40ms, network 20 to 40ms, reserving 10 to 20 percent slack
Little's Law validation: 10,000 requests per second at 200ms service time creates 2,000 concurrent in flight requests requiring sufficient thread pools and connection pools to avoid head of line blocking
Storage growth calculation: writes per day times object size times replication factor times encoding overhead, adding 20 to 50 percent for metadata, compaction, and secondary indexes
Multi region writes add at least one WAN round trip time (50 to 100ms) to synchronous replication, forcing latency budget redistribution or acceptance of RPO greater than zero for async replication
📌 Examples
Database tier sizing: Application generates 50,000 writes per second peak with 1 KB average payload. Database must sustain 50 MB per second ingest before replication. With 3x replication and 30 percent metadata overhead, actual disk write rate is 195 MB per second. If each database shard handles 20,000 writes per second, you need at least 3 shards plus 1 for N+1 redundancy, totaling 4 shards.
Cache memory allocation: Application has 5 TB dataset with 25 percent hot working set. Cache requires 1.25 TB per replica for hot data. With 3x replication and 20 percent overhead for data structures, total cache cluster memory is approximately 4.5 TB. If each cache node has 64 GB RAM, deploy at least 72 nodes.
Latency budget breakdown for social feed: User request has 200ms p95 budget. Edge CDN for static assets takes 30ms, load balancer 8ms, app server fan out to 5 cache nodes takes 60ms (parallel, max of 5 lookups at 12ms each), database fallback for cache misses 35ms, and network overhead 25ms. Total 158ms with 42ms slack for retries and jitter.
← Back to Back-of-the-envelope Calculations Overview
Capacity Sizing and Latency Budgeting Across System Tiers | Back-of-the-envelope Calculations - System Overflow