Learn→ML Infrastructure & MLOps→Cost Optimization (Spot Instances, Autoscaling)→3 of 6

ML Infrastructure & MLOps • Cost Optimization (Spot Instances, Autoscaling)Medium⏱️ ~2 min

Spot Fleet Diversification: Reducing Correlated Interruptions

Spot Fleet Diversification: Spreading spot instances across multiple instance types, availability zones, and capacity pools to reduce the risk of correlated interruptions. When demand spikes in one pool, your workload continues in others.
Why Diversification Matters
Spot interruptions are not random—they are correlated by instance type and availability zone. When a cloud provider needs capacity, they reclaim instances from specific pools. If all your instances are the same type in the same zone, a single reclamation event can terminate 100% of your fleet simultaneously. With diversification across 10 pools, a reclamation event might affect only 10-20% of your fleet. The remaining instances continue working while replacements launch in unaffected pools.
Diversification Dimensions
Instance types: Instead of only p3.2xlarge, also use p3.8xlarge, p3.16xlarge, g4dn variants. Different types have different spot pools. Availability zones: Spread across all zones in a region. Zone-specific demand spikes affect only instances in that zone. Instance families: GPU training can often run on multiple GPU generations (V100, A10, A100). Flexibility in hardware increases available capacity. The trade-off: diversification requires software that handles heterogeneous hardware gracefully.
Allocation Strategies
Lowest price: Always choose the cheapest available instance type. Maximizes savings but concentrates in popular (cheap) pools, increasing interruption risk. Capacity optimized: Choose pools with most available capacity. Reduces interruption probability but may cost slightly more. Diversified: Explicitly spread across pools regardless of price. Most resilient but requires managing heterogeneous fleet. For critical training jobs, capacity-optimized or diversified strategies are worth the small premium over lowest-price.
Rule of Thumb: Target at least 5-10 different capacity pools. With proper diversification, simultaneous interruption of more than 30% of your fleet becomes rare.

💡 Key Takeaways

✓Spot interruptions are correlated by instance type and zone, not random

✓Diversify across instance types, availability zones, and GPU generations

✓Target 5-10 different pools to limit simultaneous interruption to under 30%

📌 Interview Tips

1Single pool: 100% fleet loss on reclamation. 10 pools: 10-20% loss.

2Capacity-optimized strategy reduces interruptions for small cost premium

← Back to Cost Optimization (Spot Instances, Autoscaling) Overview