ML Infrastructure & MLOpsCost Optimization (Spot Instances, Autoscaling)Medium⏱️ ~2 min

Autoscaling Architecture: Matching Capacity to Demand

Autoscaling dynamically adjusts computing resources to match actual workload demand, eliminating the cost of idle capacity while maintaining performance guarantees. Production ML systems typically implement two layers. Horizontal autoscaling adds or removes service replicas or worker containers. Cluster autoscaling provisions or terminates the underlying machines those containers run on. The key is choosing the right signals. CPU utilization is convenient but often misleading. A recommendation service might be bottlenecked on memory or disk input/output (IOPS), showing 40 percent CPU while p95 latency climbs to 500 milliseconds. Better signals include request rate, concurrent requests per replica, queue depth for batch workers, and actual latency percentiles. For example, scale out when p95 latency exceeds 100 milliseconds for 2 consecutive minutes, or when queue depth exceeds 1,000 pending tasks. Production implementations separate scale out and scale in thresholds to prevent flapping. Scale out aggressively when latency crosses 100 milliseconds. Scale in conservatively only after p95 stays below 50 milliseconds for 10 minutes. Add stabilization windows so rapid oscillations do not trigger constant churn. Airbnb's pricing prediction service scales from 30 replicas at night to 200 replicas during booking peaks, automatically adjusting every few minutes based on request rate and maintaining sub 50 millisecond p95 latency throughout. Combining autoscaling with Spot creates powerful cost efficiency. A baseline of 30 percent on demand capacity handles typical load. Spot instances provide the burstable 70 percent during peaks. When load drops, the autoscaler terminates Spot nodes first, preserving the stable on demand baseline. This pattern lets services pay only for capacity they actually need, hour by hour.
💡 Key Takeaways
Two layer scaling: horizontal for replicas in tens of seconds, cluster for nodes in a few minutes, coordinated to prevent oscillation
Use workload specific signals like p95 latency, request rate, and queue depth instead of CPU, which often misses true bottlenecks like memory or IOPS
Separate scale out and scale in thresholds with stabilization windows to prevent flapping, for example scale out at 100 milliseconds latency but scale in only after 10 minutes below 50 milliseconds
Baseline plus burst pattern: 30 percent on demand for stable load, 70 percent Spot for peaks, terminating Spot first when scaling in
Airbnb pricing service scales 30 to 200 replicas automatically based on booking patterns, maintaining sub 50 millisecond p95 latency
Multi metric autoscaling prevents overprovisioning by considering concurrency, throughput, and latency together instead of a single dimension
📌 Examples
Recommendation service: Scale out when concurrent requests per replica exceed 50 and p95 latency crosses 100ms, scale in when both drop below 20 requests and 50ms for 10 minutes
Batch feature computation: Scale workers based on queue depth, targeting 1,000 pending tasks per worker, with 5 minute stabilization to avoid churn during bursty job submissions
Online inference: Maintain 30 replicas on demand (handles 50K requests/second baseline), scale to 100 replicas with Spot during daily peak (150K requests/second), saving 60% on peak capacity
← Back to Cost Optimization (Spot Instances, Autoscaling) Overview