Autoscaling Architecture: Matching Capacity to Demand
Autoscaling: Automatically adjusting compute capacity based on demand. Scale up when load increases (more requests, larger queue), scale down when load decreases. The goal is right-sizing: enough capacity to meet SLAs without paying for idle resources.
Scaling Metrics
Reactive metrics: CPU utilization, memory usage, request latency. When utilization exceeds threshold (e.g., 70%), add capacity. Simple but lagging—by the time utilization is high, users may already experience degradation. Predictive metrics: Queue depth, request rate trend, time-of-day patterns. Scale before load arrives based on leading indicators. More complex but better user experience. For ML inference, request queue depth is often the best signal: if requests are waiting, you need more capacity regardless of current utilization.
Scale-Up vs Scale-Down Asymmetry
Scaling up should be fast and aggressive—users are waiting. Scaling down should be slow and conservative—premature scale-down causes thrashing (scale down, load increases, scale up, repeat). Common pattern: scale up when metric exceeds threshold for 1 minute, scale down when metric stays below threshold for 10 minutes. This asymmetry reflects that over-provisioning costs money but under-provisioning costs users. For ML serving, add cooldown periods after model loading to prevent thrashing during warmup.
Capacity Limits
Autoscaling needs bounds. Minimum capacity ensures baseline availability (at least 2 replicas for redundancy). Maximum capacity prevents runaway costs from traffic spikes or bugs. Without maximums, a sudden traffic surge or infinite loop could spin up hundreds of expensive GPU instances before anyone notices. Set maximums based on budget and realistic traffic projections, with alerts when approaching limits.
Cost Efficiency: Well-tuned autoscaling can reduce infrastructure costs 40-60% compared to static provisioning for peak load, while maintaining the same SLA.