ML Infrastructure & MLOpsCost Optimization (Spot Instances, Autoscaling)Hard⏱️ ~2 min

Failure Modes: Capacity Crunches, Interruption Storms, and Cost Spikes

Cost Optimization Failures: Aggressive cost optimization can backfire through capacity crunches (no instances available), interruption storms (mass terminations), and cost spikes (fallback to expensive instances). Each failure mode requires specific detection and mitigation.

Capacity Crunches

Spot instances are not always available. During high-demand periods, your requests for spot capacity may go unfulfilled for hours or days. Symptoms: autoscaler requests instances, cloud provider returns "insufficient capacity," workloads queue indefinitely. This particularly affects GPU instances during ML hype cycles. Mitigation: maintain on-demand fallback, diversify across regions (not just zones), implement capacity reservation for critical workloads, and monitor spot availability trends to anticipate crunches.

Interruption Storms

Correlated interruptions can terminate large portions of your fleet simultaneously. A single provider capacity event might reclaim hundreds of instances across your workloads. Symptoms: sudden drop in available capacity, multiple jobs failing simultaneously, cascading failures as remaining instances become overloaded. Mitigation: diversification (discussed earlier), circuit breakers that halt new work during storms, graceful degradation that prioritizes critical workloads, and post-storm recovery automation that relaunches capacity systematically.

Cost Spikes

Aggressive autoscaling combined with spot unavailability can cause cost spikes. Scenario: traffic increases, spot not available, autoscaler launches expensive on-demand instances, bill triples. Or: misconfigured autoscaling launches too many instances, or fails to scale down after load decreases. Mitigation: hard caps on maximum instances, budget alerts with automatic shutdown, regular cost monitoring dashboards, and post-incident reviews for unexpected cost increases. A 10-minute misconfiguration can cost thousands of dollars on GPU instances.

Monitoring Checklist: Track spot fulfillment rate, interruption frequency, cost per day/week with alerts on anomalies, and queue depth during capacity events.

💡 Key Takeaways
Capacity crunches: spot unavailable for hours during high-demand periods
Interruption storms: correlated terminations affect large fleet portions simultaneously
Cost spikes: 10-minute misconfiguration can cost thousands on GPU instances
📌 Interview Tips
1Circuit breakers halt new work during interruption storms
2Budget alerts with automatic shutdown prevent runaway costs
← Back to Cost Optimization (Spot Instances, Autoscaling) Overview
Failure Modes: Capacity Crunches, Interruption Storms, and Cost Spikes | Cost Optimization (Spot Instances, Autoscaling) - System Overflow