ML Infrastructure & MLOpsCost Optimization (Spot Instances, Autoscaling)Medium⏱️ ~2 min

Autoscaling Architecture: Matching Capacity to Demand

Autoscaling: Automatically adjusting compute capacity based on demand. Scale up when load increases (more requests, larger queue), scale down when load decreases. The goal is right-sizing: enough capacity to meet SLAs without paying for idle resources.

Scaling Metrics

Reactive metrics: CPU utilization, memory usage, request latency. When utilization exceeds threshold (e.g., 70%), add capacity. Simple but lagging—by the time utilization is high, users may already experience degradation. Predictive metrics: Queue depth, request rate trend, time-of-day patterns. Scale before load arrives based on leading indicators. More complex but better user experience. For ML inference, request queue depth is often the best signal: if requests are waiting, you need more capacity regardless of current utilization.

Scale-Up vs Scale-Down Asymmetry

Scaling up should be fast and aggressive—users are waiting. Scaling down should be slow and conservative—premature scale-down causes thrashing (scale down, load increases, scale up, repeat). Common pattern: scale up when metric exceeds threshold for 1 minute, scale down when metric stays below threshold for 10 minutes. This asymmetry reflects that over-provisioning costs money but under-provisioning costs users. For ML serving, add cooldown periods after model loading to prevent thrashing during warmup.

Capacity Limits

Autoscaling needs bounds. Minimum capacity ensures baseline availability (at least 2 replicas for redundancy). Maximum capacity prevents runaway costs from traffic spikes or bugs. Without maximums, a sudden traffic surge or infinite loop could spin up hundreds of expensive GPU instances before anyone notices. Set maximums based on budget and realistic traffic projections, with alerts when approaching limits.

Cost Efficiency: Well-tuned autoscaling can reduce infrastructure costs 40-60% compared to static provisioning for peak load, while maintaining the same SLA.

💡 Key Takeaways
Queue depth is often the best ML inference scaling signal
Scale-up fast (1 minute), scale-down slow (10 minutes) to prevent thrashing
Set maximum limits to prevent runaway costs from traffic spikes or bugs
📌 Interview Tips
1Scale up at 70% utilization, scale down after 10 minutes below threshold
2Autoscaling reduces costs 40-60% vs static provisioning for peak load
← Back to Cost Optimization (Spot Instances, Autoscaling) Overview