ML Infrastructure & MLOpsCost Optimization (Spot Instances, Autoscaling)Hard⏱️ ~3 min

Production Pattern: Baseline On Demand Plus Spot Burst Capacity

Running 100 percent Spot is risky for latency sensitive services. A better production pattern uses a stable on demand baseline for typical load, with Spot instances providing burst capacity during peaks. This balances cost savings against Service Level Objective (SLO) protection. Size the on demand baseline to handle normal traffic with margin for small spikes. For a service with 50,000 requests per second average and 100,000 requests per second peak, provision 30 replicas on demand to serve 60,000 requests per second comfortably, about 20 percent above average. When load climbs toward peak, autoscaling adds Spot replicas to handle the remaining 40,000 requests per second. During the daily trough at 30,000 requests per second, only the on demand baseline runs. The result is 70 percent of capacity hours come from Spot, which costs 80 percent less, delivering roughly 56 percent total cost reduction while the on demand baseline ensures the service never goes below minimum capacity. Critical implementation details matter. Use capacity rebalancing to launch replacement Spot instances early when pools show elevated interruption risk, so new capacity arrives before old instances terminate. This keeps total capacity stable. Apply connection draining with a 90 second timeout so in flight requests finish gracefully before terminating instances. Pre warm caches and connections on new Spot replicas before adding them to the load balancer pool to avoid cold start latency spikes. Meta and Netflix use variants of this pattern extensively, with on demand or reserved capacity for control planes and SLO critical paths, and large Spot fleets for batch processing, encoding, and bursty serving traffic.
💡 Key Takeaways
Size on demand baseline at 20 percent above average load to handle normal traffic plus small spikes, ensuring minimum capacity even if Spot is unavailable
Burst layer scales from zero to multiple times baseline during peaks, with 70 percent of total capacity hours coming from cheaper Spot instances
For 50,000 requests per second average and 100,000 requests per second peak service, 30 on demand replicas plus 70 Spot replicas at peak delivers 56 percent cost reduction
Capacity rebalancing launches replacement Spot instances early before termination, maintaining total capacity and preventing SLO violations during interruptions
Pre warm caches and connections on new Spot replicas before routing traffic to avoid cold start induced p95 latency spikes of 500 plus milliseconds
Meta and Netflix use on demand or reserved for control planes and SLO critical components, Spot for batch and bursty workloads, isolating interruption risk
📌 Examples
Recommendation service: 30 on demand replicas handle 60K req/sec baseline, autoscale to 100 replicas with Spot during evening peak at 100K req/sec, scale down to 30 overnight at 30K req/sec
Feature serving: On demand baseline sized for p95 latency of 50ms at average load, Spot burst layer adds capacity when request rate climbs, connection draining ensures graceful termination within 90 seconds
Search ranking: Reserved instances for index serving (latency critical), Spot for index building and model training (batch, interruptible), separating concerns to protect user facing SLOs
← Back to Cost Optimization (Spot Instances, Autoscaling) Overview