Natural Language Processing SystemsScalability (Model Parallelism, Batching)Hard⏱️ ~3 min

Horizontal Scaling: Model Replication and Load Balancing

Replicating Models for Throughput

A single GPU running a model has a throughput ceiling. Once batching is optimized, the only way to handle more requests is more GPUs. Horizontal scaling deploys multiple model replicas behind a load balancer. 10 replicas handling 100 requests per second each serve 1000 total requests per second. Each replica is independent: no state sharing, no synchronization needed during inference.

The challenge is replica management. Loading a large model takes 30-60 seconds. If a replica crashes, you cannot instantly replace it. Cold start latency means you need spare capacity running to handle failures and traffic spikes. Most production systems run at 60-70% peak capacity, keeping 30-40% headroom for resilience.

Load Balancing Strategies

Simple round-robin load balancing ignores request complexity. A long document summarization request takes 10x longer than a short classification. If one replica gets several long requests, it queues while others sit idle. Weighted load balancing considers request characteristics: estimate processing time based on input length, route to the least loaded replica.

Queue depth is a useful metric for load balancing. Route requests to replicas with the shortest queues. This naturally balances load even when request complexity varies. Some systems expose queue depth as a health signal - a replica with growing queue depth receives fewer new requests.

💡 Key Insight: ML inference has high variance in processing time based on input length and complexity. Load balancing must account for this, unlike web services where request times are more uniform.

Autoscaling Considerations

Autoscaling ML services is harder than web services because scale-up is slow (30-60 second model load) and GPUs are expensive. Scale based on queue depth and latency percentiles, not just CPU utilization. Set conservative scale-down policies to avoid thrashing - removing a replica and immediately needing it back is wasteful.

💡 Key Takeaways
Horizontal scaling deploys independent model replicas behind load balancer - 10 replicas at 100 RPS each = 1000 RPS total
Model loading takes 30-60 seconds, so production systems run at 60-70% capacity with 30-40% headroom for failures
Round-robin ignores request complexity - use queue depth or weighted balancing based on input length
Autoscale on queue depth and latency percentiles, not CPU; use conservative scale-down to avoid thrashing
📌 Interview Tips
1Explain why ML horizontal scaling differs from web: 30-60 second cold starts require running spare capacity.
2Describe queue-depth load balancing: route to replica with shortest queue, naturally handles variable request complexity.
3For autoscaling, recommend latency percentiles (P99) over CPU metrics. GPU utilization is complex to measure.
← Back to Scalability (Model Parallelism, Batching) Overview