Cold Start Storms and Model Thrashing: Detection and Mitigation
What Cold Start Storms Look Like
Cold start storms occur when bursty traffic simultaneously hits many cold models in an on demand system, triggering a cascade of load/evict cycles that saturates IO, spikes CPU for deserialization, and inflates p95/p99 latency by seconds. The symptom is bimodal latency distribution: hot model requests remain fast (p50 of 50ms), but cold requests spike to multi second p95 (2 to 10 seconds) during the storm. Model thrashing is the steady state version: continuous eviction and reloading of the same models because cache capacity is too small relative to the working set.
Root Cause
Cache capacity versus working set mismatch combined with traffic patterns. If you have 500 models but cache capacity for only 50, and traffic is evenly distributed, every request has a 90% chance of being cold. Worse, if traffic arrives in bursts (batch jobs hitting 100 different models simultaneously), the system tries to load many models in parallel, overwhelming object storage bandwidth (S3 throttling at 3500 requests per second per prefix) and local disk IO. Logs show high concurrent load counts, artifact fetch timeouts, and LRU eviction rates spiking to hundreds per minute.
Pre-warming and Pinning
Pre-warm the top N models (by traffic volume) on startup, keeping them resident and marking them as non evictable. If the top 50 models account for 80% of traffic, pinning them eliminates cold starts for most requests. For the remaining long tail, set admission control limiting parallel loads to 1 to 2 per host: queue cold load requests and serialize them, preventing IO saturation at the cost of longer waits for tail models. Reserve 20 to 30% of RAM/VRAM as headroom for thrash absorption.
Size-Aware Eviction
Standard LRU evicts by recency alone, but large models (1GB) evict 10 small models (100MB each), worsening thrash. Size aware LRU evicts based on cost per byte (access frequency divided by size), keeping frequently accessed small models resident longer. Cohort sharding is the ultimate fix: partition the model fleet into size or traffic cohorts (hot tier, warm tier, cold tier) and deploy separate serving pools with tailored cache sizes and autoscaling policies.