The Three Planes of Horizontal Scaling

Horizontal scaling decomposes a system along three orthogonal planes, each addressing different bottlenecks. The compute plane focuses on stateless application servers behind load balancers. Session state is externalized to caches, databases, or JSON Web Tokens (JWTs) so that any server can handle any request, enabling elastic scaling and zero downtime deployments. This is the easiest plane to scale because you simply add more identical nodes.

The data plane splits into two patterns: replication for read scaling and availability, and partitioning (sharding) for write scaling. Read replicas multiply your read capacity linearly; if one primary handles 5,000 read queries per second (QPS), three replicas can serve 15,000 QPS for read heavy workloads. Sharding divides data by a partition key (like user_id) so writes spread across multiple primaries; with 10 shards, you can handle 10x the write throughput. The challenge is choosing partition keys that distribute load evenly and keep related data colocated to avoid expensive cross shard queries.

The delivery plane moves bytes closer to users via Content Delivery Networks (CDNs) and edge caches, compresses responses, and offloads expensive tasks to asynchronous pipelines. Netflix serves over 90% of video bytes from Internet Service Provider (ISP) embedded edge appliances, each delivering tens to hundreds of gigabits per second. This reduces origin load by orders of magnitude and keeps startup latency under a couple of seconds globally. For dynamic content, the delivery plane includes precomputation (materialized views, feed generation) and background processing queues that decouple heavy fanout from the critical user path.

💡 Key Takeaways

•Compute plane scaling is the simplest: stateless services behind load balancers with externalized session state allow you to add capacity linearly by launching more identical nodes

•Data plane replication multiplies read capacity (3 replicas = 3x read QPS) but writes remain bottlenecked on the primary; sharding spreads writes across multiple primaries but complicates cross shard queries and requires careful partition key selection

•Delivery plane techniques like CDNs can achieve over 90% hit ratios for static assets, reducing origin requests per second (RPS) by 10x or more and absorbing traffic spikes with edge capacity measured in terabits per second

•The data plane forces a fundamental trade off: replication helps reads and availability but adds eventual consistency lag (often milliseconds to seconds under load), while sharding helps writes but creates data locality and cross shard join challenges

📌 Examples

Meta uses massive distributed cache tiers (hundreds of terabytes per region) in front of the social graph database, serving millions of QPS with sub millisecond latency and reducing database read load by over an order of magnitude

Twitter uses hybrid fanout: push tweets to online follower timelines asynchronously via queues (fanout on write) for active users, and pull on demand (fanout on read) for inactive accounts to balance write amplification and read latency

Reddit relies on CDN edge caching for images and static assets with over 90% hit ratios, keeping origin RPS manageable during front page traffic spikes and reducing egress bandwidth costs

← Back to Scalability Fundamentals Overview