ML Infrastructure & MLOps • Cost Optimization (Spot Instances, Autoscaling)Easy⏱️ ~2 min
What Are Spot Instances and Why Use Them for ML Workloads?
Spot Instances are spare cloud computing capacity sold at steep discounts, typically 70 to 90 percent cheaper than on demand pricing. The tradeoff is that cloud providers can reclaim this capacity with short notice, usually around 2 minutes, when they need it back for paying customers. This makes Spot ideal for workloads that can tolerate interruptions.
ML training and data processing are perfect candidates. A training job running on 100 GPU instances at $3 per hour on demand costs $300 per hour. With Spot pricing at $0.90 per hour, the same job costs $90 per hour, saving $210 every hour or about $5,000 over a 24 hour training run. The key is designing jobs to checkpoint progress every few minutes, so when an interruption happens, you resume from the last checkpoint rather than starting over.
Production systems combine Spot with smart allocation strategies. A price and capacity optimized strategy balances cost and availability by diversifying across many instance types and availability zones. This approach costs only about 1 percent more than the absolute lowest price pools, but reduces interruption rates from around 20 percent down to about 3 percent. The math is compelling: if your training job checkpoints every 5 minutes and interruptions happen 3 percent of the time, you waste roughly 15 seconds per interruption, which is negligible compared to the 80 percent cost savings.
💡 Key Takeaways
•Spot Instances cost 70 to 90 percent less than on demand, with $3 per hour GPU instances dropping to $0.90 per hour, saving thousands on multi day training runs
•Cloud providers can interrupt Spot capacity with about 2 minutes notice when they need to reclaim it for regular customers
•Price and capacity optimized allocation reduces interruption rates from 20 percent to 3 percent by diversifying across instance types and zones, with only 1 percent higher cost
•Workloads must checkpoint progress every few minutes so interruptions waste only seconds of work instead of hours
•Best for stateless batch jobs like ML training, feature computation, and data processing where individual tasks can be rescheduled
•Not suitable for latency critical online serving or stateful databases that cannot tolerate sudden capacity loss
📌 Examples
Netflix uses Spot Instances for video encoding workloads, saving millions annually on batch processing that can tolerate interruptions
Uber runs large data processing pipelines on Spot, with orchestrators automatically rescheduling tasks when nodes are reclaimed
ML training job: 100 GPU instances for 24 hours. On demand cost: $7,200. Spot cost: $2,160. Savings: $5,040 per training run