Model Serving & InferenceMulti-model ServingHard⏱️ ~3 min

Cold Start Storms and Model Thrashing: Detection and Mitigation

Cold start storms occur when bursty traffic simultaneously hits many cold models in an on demand multi-model system, triggering a cascade of load/evict cycles that saturates Input/Output (IO), spikes CPU for deserialization, and inflates p95/p99 latency by seconds. The symptom is bimodal latency distribution splitting into two clusters: hot model requests remain fast (for example, p50 of 50ms), but cold requests spike to multi second p95 (2 to 10 seconds) during the storm. Model thrashing is the steady state version: continuous eviction and reloading of the same models because cache capacity is too small relative to the working set. The root cause is cache capacity versus working set mismatch combined with traffic patterns. If you have 500 models but cache capacity for only 50, and traffic is evenly distributed, every request has a 90% chance of being cold. Worse, if traffic arrives in bursts (for example, batch jobs hitting 100 different models simultaneously), the system tries to load many models in parallel, overwhelming object storage bandwidth (S3 throttling at 3500 requests per second per prefix) and local disk IO. TorchServe or Triton logs show high concurrent load counts, artifact fetch timeouts, and Least Recently Used (LRU) eviction rates spiking to hundreds per minute. Production mitigation strategies include pre-warming and pinning. Pre-warm the top N models (by traffic volume) on startup or deploy, keeping them resident and marking them as non evictable. For example, if the top 50 models account for 80% of traffic, pinning them eliminates cold starts for most requests. For the remaining long tail, set admission control limiting parallel loads to 1 to 2 per host: queue cold load requests and serialize them, preventing IO saturation at the cost of longer waits for tail models. SageMaker MME users commonly reserve 20 to 30% of RAM/VRAM as headroom for thrash absorption and configure CloudWatch alarms on eviction rate and cold load latency. Another technique is model size based cache eviction policy. Standard Least Recently Used (LRU) evicts by recency alone, but large models (1 gigabyte) evict 10 small models (100 megabytes each), worsening thrash. Size aware LRU evicts based on cost per byte (access frequency divided by size), keeping frequently accessed small models resident longer. Cohort sharding is the ultimate fix: partition the model fleet into size or traffic cohorts (hot tier, warm tier, cold tier) and deploy separate serving pools with tailored cache sizes and autoscaling policies. The hot tier pins critical models, the warm tier uses aggressive caching, and the cold tier accepts high latency for rarely used models.
💡 Key Takeaways
Cold start storms inflate p95/p99 by seconds (2 to 10s) when bursty traffic hits many cold models simultaneously, saturating IO and causing load/evict thrashing with eviction rates spiking to 150+ per minute
Root cause is cache capacity versus working set mismatch: 500 models with cache for 50 means 90% cold hit rate; even distribution makes every request likely cold
Pre-warming and pinning top N models (for example, top 50 accounting for 80% traffic) eliminates most cold starts; mark them non evictable to prevent thrashing
Admission control limits parallel loads to 1 to 2 per host, serializing cold loads to prevent IO saturation at cost of longer tail latency for rare models
Size aware LRU evicts by cost per byte (access frequency divided by size) instead of recency alone, preventing large 1GB models from evicting 10 small 100MB models
📌 Examples
Amazon SageMaker MME customer serving 1200 product category models: pre-warmed top 100 models (covering 75% traffic) on deploy, reduced p99 from 12 seconds to 800ms, cut eviction rate from 200/min to 15/min
Netflix recommendation system during prime time surge: 300 user segment models requested simultaneously, triggered cold start storm with p95 spiking to 6 seconds; added admission control limiting 2 parallel loads per host, p95 dropped to 1.2 seconds
Stripe fraud detection with cohort sharding: 20 high volume merchant models in hot tier (pinned), 200 medium volume in warm tier (aggressive caching), 500 low volume in cold tier (accept 5s p99); reduced infrastructure cost 40% versus uniform deployment
← Back to Multi-model Serving Overview