ML Infrastructure & MLOpsCost Optimization (Spot Instances, Autoscaling)Medium⏱️ ~2 min

Spot Fleet Diversification: Reducing Correlated Interruptions

When all your Spot capacity runs on a single instance type in one availability zone, a capacity reclaim event can wipe out 80 percent of your fleet in minutes. This correlated failure is the biggest operational risk with Spot Instances. The solution is aggressive diversification across many instance types and geographic zones. Production systems manage 20 to 30 different instance types simultaneously. Instead of requesting only c5.4xlarge instances, you create a fleet that includes c5.2xlarge, c5.4xlarge, c5.9xlarge, m5.4xlarge, m5a.4xlarge, r5.4xlarge, and similar types across three availability zones. Each pool represents at most 20 to 30 percent of total capacity. When one pool faces high reclaim rates, the other pools continue running. ThousandEyes reported managing two dozen instance types across Spot and on demand node groups to maximize availability during capacity crunches. The tradeoff is operational complexity. Different instance types have different CPU, memory, and network characteristics. Your workload orchestrator must handle heterogeneous nodes, bin packing containers efficiently across varying node sizes. Performance becomes less predictable when the same job might run on 4 cores with 16 gigabytes (GB) memory or 8 cores with 32 GB memory. Debugging gets harder when you cannot reproduce issues on the exact instance type. However, the stability gain is substantial. Moving from a single pool to 10 diversified pools typically cuts interruption related downtime from several hours per week to minutes.
💡 Key Takeaways
Homogeneous fleets create correlated risk where a single capacity event can terminate 80 percent of instances, causing hours of disruption
Production systems diversify across 20 to 30 instance types and 3 availability zones, capping each pool at 20 to 30 percent of total capacity
Diversification reduces interruption rates from around 20 percent to around 3 percent, with only about 1 percent higher unit cost versus lowest price pools
Tradeoff is increased complexity in bin packing, performance variability across heterogeneous hardware, and harder debugging when issues are instance type specific
ThousandEyes manages two dozen instance types simultaneously to maintain availability during regional Spot capacity shortages
Prefer allocation strategies labeled price and capacity optimized or capacity optimized prioritized, which balance cost and interruption risk automatically
📌 Examples
ML training fleet: Mix c5.4xlarge, c5.9xlarge, m5.4xlarge, m5a.4xlarge across us-east-1a, us-east-1b, us-east-1c with each pool at 10 to 15 percent of 1,000 node capacity
Data processing pipeline: Run Spark executors on 15 different instance families, with orchestrator automatically placing tasks based on available capacity, reducing pipeline failures from 8 per week to less than 1
Feature computation: Workers accept any instance with 16 GB plus memory and 4 plus cores, allowing scheduler to fill capacity from 25 different Spot pools, maintaining 97 percent uptime
← Back to Cost Optimization (Spot Instances, Autoscaling) Overview
Spot Fleet Diversification: Reducing Correlated Interruptions | Cost Optimization (Spot Instances, Autoscaling) - System Overflow