Model Serving & InferenceMulti-model ServingHard⏱️ ~3 min

Cold Start Storms and Model Thrashing: Detection and Mitigation

What Cold Start Storms Look Like

Cold start storms occur when bursty traffic simultaneously hits many cold models in an on demand system, triggering a cascade of load/evict cycles that saturates IO, spikes CPU for deserialization, and inflates p95/p99 latency by seconds. The symptom is bimodal latency distribution: hot model requests remain fast (p50 of 50ms), but cold requests spike to multi second p95 (2 to 10 seconds) during the storm. Model thrashing is the steady state version: continuous eviction and reloading of the same models because cache capacity is too small relative to the working set.

Root Cause

Cache capacity versus working set mismatch combined with traffic patterns. If you have 500 models but cache capacity for only 50, and traffic is evenly distributed, every request has a 90% chance of being cold. Worse, if traffic arrives in bursts (batch jobs hitting 100 different models simultaneously), the system tries to load many models in parallel, overwhelming object storage bandwidth (S3 throttling at 3500 requests per second per prefix) and local disk IO. Logs show high concurrent load counts, artifact fetch timeouts, and LRU eviction rates spiking to hundreds per minute.

Pre-warming and Pinning

Pre-warm the top N models (by traffic volume) on startup, keeping them resident and marking them as non evictable. If the top 50 models account for 80% of traffic, pinning them eliminates cold starts for most requests. For the remaining long tail, set admission control limiting parallel loads to 1 to 2 per host: queue cold load requests and serialize them, preventing IO saturation at the cost of longer waits for tail models. Reserve 20 to 30% of RAM/VRAM as headroom for thrash absorption.

Size-Aware Eviction

Standard LRU evicts by recency alone, but large models (1GB) evict 10 small models (100MB each), worsening thrash. Size aware LRU evicts based on cost per byte (access frequency divided by size), keeping frequently accessed small models resident longer. Cohort sharding is the ultimate fix: partition the model fleet into size or traffic cohorts (hot tier, warm tier, cold tier) and deploy separate serving pools with tailored cache sizes and autoscaling policies.

💡 Key Takeaways
Cold start storms inflate p95/p99 by seconds (2 to 10s) when bursty traffic hits many cold models simultaneously, saturating IO and causing load/evict thrashing with eviction rates spiking to 150+ per minute
Root cause is cache capacity versus working set mismatch: 500 models with cache for 50 means 90% cold hit rate; even distribution makes every request likely cold
Pre-warming and pinning top N models (for example, top 50 accounting for 80% traffic) eliminates most cold starts; mark them non evictable to prevent thrashing
Admission control limits parallel loads to 1 to 2 per host, serializing cold loads to prevent IO saturation at cost of longer tail latency for rare models
Size aware LRU evicts by cost per byte (access frequency divided by size) instead of recency alone, preventing large 1GB models from evicting 10 small 100MB models
📌 Interview Tips
1Amazon SageMaker MME customer serving 1200 product category models: pre-warmed top 100 models (covering 75% traffic) on deploy, reduced p99 from 12 seconds to 800ms, cut eviction rate from 200/min to 15/min
2Netflix recommendation system during prime time surge: 300 user segment models requested simultaneously, triggered cold start storm with p95 spiking to 6 seconds; added admission control limiting 2 parallel loads per host, p95 dropped to 1.2 seconds
3Stripe fraud detection with cohort sharding: 20 high volume merchant models in hot tier (pinned), 200 medium volume in warm tier (aggressive caching), 500 low volume in cold tier (accept 5s p99); reduced infrastructure cost 40% versus uniform deployment
← Back to Multi-model Serving Overview