Horizontal Scaling: Model Replication and Load Balancing
Replicating Models for Throughput
A single GPU running a model has a throughput ceiling. Once batching is optimized, the only way to handle more requests is more GPUs. Horizontal scaling deploys multiple model replicas behind a load balancer. 10 replicas handling 100 requests per second each serve 1000 total requests per second. Each replica is independent: no state sharing, no synchronization needed during inference.
The challenge is replica management. Loading a large model takes 30-60 seconds. If a replica crashes, you cannot instantly replace it. Cold start latency means you need spare capacity running to handle failures and traffic spikes. Most production systems run at 60-70% peak capacity, keeping 30-40% headroom for resilience.
Load Balancing Strategies
Simple round-robin load balancing ignores request complexity. A long document summarization request takes 10x longer than a short classification. If one replica gets several long requests, it queues while others sit idle. Weighted load balancing considers request characteristics: estimate processing time based on input length, route to the least loaded replica.
Queue depth is a useful metric for load balancing. Route requests to replicas with the shortest queues. This naturally balances load even when request complexity varies. Some systems expose queue depth as a health signal - a replica with growing queue depth receives fewer new requests.
Autoscaling Considerations
Autoscaling ML services is harder than web services because scale-up is slow (30-60 second model load) and GPUs are expensive. Scale based on queue depth and latency percentiles, not just CPU utilization. Set conservative scale-down policies to avoid thrashing - removing a replica and immediately needing it back is wasteful.