Cold Start Problem: Model Loading and Predictive Warming Strategies

Cold Start Breakdown
Cold start latency is the dominant constraint for GPU autoscaling in real time inference. The total delay from scaling decision to serving traffic includes: node provisioning (60 to 120 seconds for cloud VMs), GPU driver initialization (20 to 40 seconds), container image pull for GPU optimized images (30 to 90 seconds depending on registry proximity), and model weight loading from object storage (100 to 300 seconds for multi gigabyte models). A LLM at 10GB can take 180+ seconds just to load weights into GPU memory.
Why Reactive Scaling Fails
If your autoscaler triggers on queue depth exceeding 50 requests and takes 240 seconds to bring a new replica online, every request during that window experiences degraded latency or timeouts. A sudden traffic spike from 100 to 1000 requests per second causes immediate SLO violations while the system scrambles to add capacity.
Predictive Autoscaling
Uses traffic forecasts or reinforcement learning based controllers to anticipate load and pre warm capacity minutes before spikes occur. Historical patterns (like daily/weekly seasonality) plus real time signals stage GPU nodes and load models proactively. The risk is over provisioning if forecasts drift, requiring cost guardrails and maximum capacity caps. Warm pools maintain a small baseline of ready replicas even during idle periods, trading idle cost (perhaps one replica at $2/hour) for instant response to initial traffic.
Model Loading Optimization
Techniques include splitting weights into chunks for parallel download, using nearby object storage or CDN to reduce transfer time from 200 seconds to 60 seconds, lazy loading layers progressively as requests arrive, and caching popular models on persistent volumes attached to GPU nodes. Health check grace periods must account for load time: a 180 second grace period prevents Kubernetes from killing pods that are legitimately initializing. Termination grace periods (like 600 seconds) allow in flight inference to drain before shutdown.

💡 Key Takeaways

✓Total cold start latency reaches 240 to 280 seconds combining node provision (60 to 120s), driver init (20 to 40s), image pull (30 to 90s), and model weight loading (100 to 300s for 10GB models)

✓Reactive autoscaling triggers after queue buildup causes every request during the 240 second ramp to violate p99 SLO, making it unsuitable for latency critical inference without warm pools

✓Predictive autoscaling using historical patterns or reinforcement learning pre warms capacity 5 to 10 minutes before forecasted spikes, hiding cold start entirely at risk of over provisioning if forecasts miss

✓Warm pools maintain one to two baseline replicas during idle periods trading $2 to $4 per hour idle cost for instant response to initial traffic, then reactive scaling adds capacity for sustained load

✓Model loading optimization reduces time from 200 seconds to 60 seconds through parallel chunk downloads, nearby object storage or Content Delivery Network (CDN), and persistent volume caching of popular models on GPU nodes

📌 Interview Tips

1Production configuration: health check grace period of 180 seconds accommodates model loading; termination grace of 600 seconds allows in flight inference to complete before pod shutdown

2Large language model serving uses persistent volumes to cache 10GB model weights on GPU nodes, reducing subsequent pod startup from 180 seconds to 45 seconds by skipping object storage download

3Predictive autoscaler at Meta uses reinforcement learning on traffic patterns to pre warm GPU capacity 8 minutes before daily peak, maintaining p99 latency under 150ms target during traffic increase from 500 to 5000 requests per second

← Back to Autoscaling & GPU Resource Management Overview