On-Demand Loading vs Multi-Deployed: Latency and Cost Trade-offs
The Fundamental Trade-off
The choice between on demand loading and multi-deployed patterns is a trade off between cost efficiency and latency predictability. On demand loading fetches models lazily from object storage on first request and caches them with LRU eviction when capacity is reached. Multi-deployed keeps all model versions permanently resident in memory with fixed traffic splits, never triggering cold loads.
When On-Demand Excels
On demand loading excels for long tail workloads where most models receive sparse traffic. Consider a fleet of 1000 models where 900 receive under 0.1 QPS: dedicating one instance per model wastes idle capacity. By sharing a 20 node fleet with on demand loading, you achieve 50x better utilization. The cost is bimodal latency. Hot models serve at normal speed (p50 of 45ms for small CPU models). Cold loads add model fetch and deserialization time: 100 to 800ms for models under 100MB on SSD cache, or 2 to 20 seconds for gigabyte scale NLP models pulled from S3.
When Multi-Deployed Wins
Multi-deployed patterns are used when p95 latency SLOs are strict and traffic is concentrated. Running A/B tests with 95/5 canary splits keeps both model versions hot, trading roughly 2x memory cost for stable p95 with negligible routing overhead (under 1ms). This works when model count is low (typically 2 to 5 versions per endpoint) and each model fits comfortably in device memory. For LLMs requiring 6 to 8GB of VRAM per 7B model, multi-deployed on a single 40GB GPU is limited to 4 to 5 models maximum.
The Cold Start Storm Failure
If on demand loading faces bursty traffic hitting many cold models simultaneously, you get a cold start storm where load/evict cycles thrash, IO saturates pulling artifacts, and p95 spikes by seconds. Mitigation requires pre-warming the top N models, pinning critical models in cache to prevent eviction, and admission control limiting parallel loads to 1 to 2 per host.