Learn→ML Infrastructure & MLOps→Cost Optimization (Spot Instances, Autoscaling)→1 of 6

ML Infrastructure & MLOps • Cost Optimization (Spot Instances, Autoscaling)Easy⏱️ ~2 min

What Are Spot Instances and Why Use Them for ML Workloads?

Spot Instances: Cloud compute capacity sold at discounted prices (60-90% off on-demand) with the caveat that the provider can reclaim the instance with short notice (typically 2 minutes). The discount reflects the interruptibility risk—you pay less because your workload might be terminated.
Why Spot Works for ML
ML training is often fault-tolerant. If an instance is terminated mid-training, you can resume from a checkpoint rather than starting over. This makes ML an ideal spot workload: save checkpoints every 10-30 minutes, and a 2-minute warning gives time to save final state. The math is compelling: if spot is 70% cheaper and you lose 10% of compute time to interruptions, you are still paying 30% for 90% of the work—effectively 67% cheaper than on-demand. For large training jobs costing thousands of dollars, spot savings are substantial.
Workload Suitability
Good for spot: Training jobs with checkpointing, batch inference without SLAs, hyperparameter search (losing one trial is acceptable), data preprocessing pipelines. Bad for spot: Real-time inference serving (interruption causes user-facing errors), jobs that cannot checkpoint (stateful processing without persistence), tight deadline jobs where interruption means missing the deadline. The key question: can your workload handle being killed and restarted with minimal wasted work?
Pricing Dynamics
Spot prices fluctuate based on supply and demand. When cloud regions have excess capacity, spot is cheap and available. During demand spikes (quarter-end, major events), spot prices rise and availability drops. GPU instances are particularly volatile: a popular new model release can cause GPU spot prices to spike 5x within hours. Monitor spot price history before committing to spot-heavy architectures, and have fallback plans for high-demand periods.
Cost Reality: A 100-GPU training cluster at on-demand prices might cost 50,000 per week. The same cluster on spot (with proper fault tolerance) might cost 15,000-20,000 per week. The engineering investment in spot-readiness pays for itself quickly.

💡 Key Takeaways

✓Spot instances are 60-90% cheaper with 2-minute termination warning

✓ML training is ideal: checkpoint every 10-30 minutes, resume after interruption

✓GPU spot prices are volatile: 5x spikes possible during demand surges

📌 Interview Tips

170% cheaper spot with 10% interruption loss = 67% effective savings

2100-GPU cluster: 50,000/week on-demand vs 15,000-20,000/week on spot

← Back to Cost Optimization (Spot Instances, Autoscaling) Overview