On-Demand Loading vs Multi-Deployed: Latency and Cost Trade-offs

The Fundamental Trade-off
The choice between on demand loading and multi-deployed patterns is a trade off between cost efficiency and latency predictability. On demand loading fetches models lazily from object storage on first request and caches them with LRU eviction when capacity is reached. Multi-deployed keeps all model versions permanently resident in memory with fixed traffic splits, never triggering cold loads.
When On-Demand Excels
On demand loading excels for long tail workloads where most models receive sparse traffic. Consider a fleet of 1000 models where 900 receive under 0.1 QPS: dedicating one instance per model wastes idle capacity. By sharing a 20 node fleet with on demand loading, you achieve 50x better utilization. The cost is bimodal latency. Hot models serve at normal speed (p50 of 45ms for small CPU models). Cold loads add model fetch and deserialization time: 100 to 800ms for models under 100MB on SSD cache, or 2 to 20 seconds for gigabyte scale NLP models pulled from S3.
When Multi-Deployed Wins
Multi-deployed patterns are used when p95 latency SLOs are strict and traffic is concentrated. Running A/B tests with 95/5 canary splits keeps both model versions hot, trading roughly 2x memory cost for stable p95 with negligible routing overhead (under 1ms). This works when model count is low (typically 2 to 5 versions per endpoint) and each model fits comfortably in device memory. For LLMs requiring 6 to 8GB of VRAM per 7B model, multi-deployed on a single 40GB GPU is limited to 4 to 5 models maximum.
The Cold Start Storm Failure
If on demand loading faces bursty traffic hitting many cold models simultaneously, you get a cold start storm where load/evict cycles thrash, IO saturates pulling artifacts, and p95 spikes by seconds. Mitigation requires pre-warming the top N models, pinning critical models in cache to prevent eviction, and admission control limiting parallel loads to 1 to 2 per host.

💡 Key Takeaways

✓On demand loading achieves 3 to 10x cost reduction for long tail workloads (most models under 0.1 QPS) by sharing infrastructure, but cold loads add 100ms to 20 seconds latency

✓Multi-deployed keeps all models resident in memory for stable p95, used for A/B testing where both model versions must have predictable latency, trading roughly 2x higher memory cost

✓Cold start time correlates with artifact size: models under 100 megabytes load in 100 to 800ms from SSD, gigabyte scale models take 2 to 20 seconds from S3

✓For 7 billion parameter large language models requiring 6 to 8 gigabytes VRAM each, a 40 gigabyte GPU supports at most 4 to 5 resident models before memory pressure causes failures

✓Cold start storms occur when bursty traffic hits many cold models, causing load/evict thrashing and p95 spikes; mitigate by pinning top N models and limiting parallel loads to 1 to 2 per host

📌 Interview Tips

1Amazon SageMaker MME customer hosting 800 fraud detection models on 15 instances with on demand loading: p50 of 40ms hot, p95 of 1.2 seconds including occasional cold loads, versus $25K/month for dedicated endpoints

2Google Vertex AI A/B test with two ranking models (200MB each) kept resident on 16GB GPU: both serve at p95 of 65ms, traffic split 95/5, zero cold starts over 2 week experiment

3Uber trip pricing serving 50 city specific models on shared fleet: top 10 cities pinned in cache (70% of traffic), remaining 40 cities served on demand with p99 of 2 seconds during cold loads

← Back to Multi-model Serving Overview