What is Scalability? Core Definition and Scaling Dimensions

Scalability is the property of a system to increase capacity (throughput, data size, concurrency) while keeping user visible performance within a defined budget, for example keeping 95th percentile (p95) latency under 200 milliseconds for an interactive API. This is not just about handling more traffic but doing so without degrading the experience for existing users.

Two orthogonal scaling motions exist: vertical scaling (scaling up to bigger machines) and horizontal scaling (scaling out to more machines). Vertical scaling is conceptually simple because you keep a single node architecture, but you hit hard limits: the largest cloud instances offer around 400 CPU cores and several terabytes of RAM, and a single node creates a single point of failure. Horizontal scaling can grow to thousands of nodes and improves availability through redundancy, but requires distribution mechanisms like load balancing, partitioning, and replication.

The choice between these approaches depends on your growth trajectory and availability requirements. If you are processing 500 requests per second (RPS) with room to 5,000 RPS on a single large machine, vertical scaling defers complexity. Once you need to handle sustained tens of thousands of RPS, absorb multi-region failures, or store petabytes, horizontal scaling becomes necessary despite its operational cost.

💡 Key Takeaways

•Scalability means increasing capacity while maintaining performance SLOs (Service Level Objectives), such as keeping p95 latency under 200 ms as load grows from 1,000 to 100,000 RPS

•Vertical scaling is limited by hardware ceilings (largest instances around 400 cores, several TB RAM) and creates single points of failure with no redundancy

•Horizontal scaling can grow to thousands of nodes and improves availability, but requires distribution primitives like load balancers, partitioning schemes, and replication protocols

•The practical threshold for moving to horizontal scaling is often when sustained traffic exceeds what a single large instance can handle, typically around 10,000 to 50,000 RPS depending on workload

📌 Examples

Netflix scales horizontally with thousands of stateless microservices behind load balancers, handling millions of concurrent streams globally with regional failover

Reddit initially scaled vertically with PostgreSQL before sharding and introducing caching layers when single database performance became a bottleneck at high traffic volumes

← Back to Scalability Fundamentals Overview