ML Infrastructure & MLOps • Cost Optimization (Spot Instances, Autoscaling)Hard⏱️ ~2 min
Failure Modes: Capacity Crunches, Interruption Storms, and Cost Spikes
Even well designed Spot systems hit failure modes in production. Regional capacity shortages can cause cascading fallback where workloads automatically shift to on demand, spiking costs from $10,000 per day to $80,000 per day in hours. Interruption storms happen when a homogeneous fleet loses most capacity simultaneously. Autoscaler feedback loops create oscillations where nodes arrive too slowly, triggering more scale out requests that overshoot, then collapse.
Capacity crunch mitigation requires workload prioritization and budget guardrails. Classify workloads into critical, normal, and best effort tiers. When provisioning times exceed thresholds like 5 minutes, pause best effort jobs instead of falling back to expensive on demand capacity. Set budget alerts at 85 percent of monthly target and hard stops at 100 percent for non critical workloads. This contains cost explosions. During a 2023 Amazon Web Services (AWS) capacity event in us-east-1, organizations without these guardrails saw compute bills triple as thousands of Spot requests fell back to on demand simultaneously.
Interruption storms require aggressive diversification, covered earlier, but also blast radius limits. Cap any single Spot pool at 20 to 30 percent of total fleet capacity so one pool event cannot take down the majority of your system. For autoscaler stability, add stabilization windows of 5 to 10 minutes for scale in decisions, limit maximum scale step to 50 percent of current capacity per cycle, and use provisioning timeouts so failed node launches do not block the queue indefinitely. Monitor interruption rates by pool and allocation strategy. If a pool consistently hits 15 to 20 percent interruption rates, remove it from rotation even if price is attractive.
💡 Key Takeaways
•Capacity crunches cause cascading fallback to on demand, spiking daily costs from $10,000 to $80,000 in hours if workloads lack prioritization and budget guardrails
•Classify workloads as critical, normal, or best effort, pausing lower priority jobs during capacity shortages instead of paying for expensive on demand capacity
•Set budget alerts at 85 percent of monthly target and hard stops at 100 percent for non critical workloads to contain cost explosions during regional capacity events
•Interruption storms hit homogeneous fleets when a single pool event terminates 80 percent of capacity, mitigated by capping each pool at 20 to 30 percent of total fleet
•Autoscaler feedback loops cause oscillations when nodes provision slowly, add stabilization windows of 5 to 10 minutes and limit scale step to 50 percent per cycle
•Monitor per pool interruption rates and remove pools exceeding 15 to 20 percent sustained interruption even if price is attractive, as instability costs more than savings
📌 Examples
2023 AWS us-east-1 capacity crunch: Organizations without budget guardrails saw compute bills triple as thousands of Spot requests fell back to on demand simultaneously
ML training fleet: Single c5.4xlarge pool represented 80% of capacity, one interruption event terminated 800 of 1,000 nodes, halting all training for 45 minutes until diversification deployed
Data pipeline: Autoscaler oscillated between 100 and 500 nodes every 10 minutes when node startup took 8 minutes but scale decisions happened every 2 minutes, stabilized with 10 minute scale in window