Design FundamentalsScalability FundamentalsMedium⏱️ ~3 min

Scaling Decision Framework: When and How to Scale

The First Question: Should You Scale at All? Before adding infrastructure, ask whether optimization solves the problem cheaper. A single PostgreSQL instance handles 10,000 to 50,000 queries per second with proper indexing. Adding an index takes minutes and costs nothing. Adding read replicas takes days and costs hundreds per month. Many teams scale prematurely when a missing index or N+1 query is the real bottleneck.
Vertical First
Under 10K RPS, single node handles it. Upgrade CPU and RAM. Simpler ops, lower cost.
vs
Horizontal When
Above 50K RPS, need HA, multi region latency, or fault tolerance requirements.
Metrics That Trigger Scaling: Scale when you see these sustained for 15+ minutes during peak hours: Central Processing Unit (CPU) utilization above 70% means requests are waiting for compute. Memory above 80% risks swap thrashing and garbage collection storms. Request queue depth growing means throughput capacity is exceeded. p99 latency exceeding 5x your p50 indicates resource contention. Error rate above 0.1% from timeouts or connection failures signals overload. The Cost Reality: Horizontal scaling has hidden costs beyond server bills. Ten small servers need configuration management, deployment pipelines hitting 10 targets, monitoring dashboards tracking 10 instances, and on call runbooks for distributed failure modes. A single large server costs more per unit of compute but less in operational overhead. For a startup with 2 engineers, operational simplicity often beats theoretical scalability. The break even point varies, but typically: below 5,000 requests per second, vertical scaling wins on total cost. Between 5,000 and 50,000 requests per second, either works depending on availability requirements. Above 50,000 requests per second, horizontal scaling becomes necessary regardless of preference.
✓ Decision Checklist: Scale horizontally when you need: (1) 99.99%+ availability with automated failover, (2) geographic distribution for latency, (3) sustained throughput beyond single node limits, or (4) workload isolation between tenants. Otherwise, scale vertically.
💡 Key Takeaways
Before scaling, optimize first: adding a database index takes minutes and costs nothing versus adding replicas which takes days and hundreds per month
Scale triggers: CPU above 70%, memory above 80%, growing queue depth, p99 exceeding 5x p50, or error rates above 0.1% sustained for 15+ minutes at peak
Below 5K requests per second vertical usually wins on total cost; 5K to 50K either works; above 50K horizontal becomes necessary regardless of preference
Horizontal scaling hidden costs include configuration management, deployment complexity, monitoring overhead, and distributed system debugging skills
Choose horizontal when you need 99.99%+ availability, geographic distribution, sustained throughput beyond single node limits, or workload isolation between tenants
📌 Examples
1A startup scaled from 1 to 4 PostgreSQL read replicas to handle load, then discovered a missing index was causing full table scans; after adding the index, they scaled back to 1 replica and saved 3,000 dollars per month
2Stripe processes millions of transactions on a relatively small number of powerful database servers using vertical scaling plus careful query optimization before resorting to horizontal sharding
3Discord started with a monolithic Python application on large servers before gradually decomposing into microservices as they grew past 10 million concurrent users and needed independent scaling
← Back to Scalability Fundamentals Overview