Model Serving & Inference • Multi-model ServingMedium⏱️ ~2 min
On-Demand Loading vs Multi-Deployed: Latency and Cost Trade-offs
The choice between on demand loading and multi-deployed patterns is fundamentally a trade off between cost efficiency and latency predictability. On demand loading fetches models lazily from object storage on first request and caches them in memory with Least Recently Used (LRU) eviction when capacity is reached. Multi-deployed keeps all model versions permanently resident in memory with fixed traffic splits, never triggering cold loads.
On demand loading excels for long tail workloads where most models receive sparse traffic. Consider a fleet of 1000 models where 900 receive under 0.1 QPS: dedicating one instance per model wastes idle capacity. By sharing a 20 node fleet with on demand loading, you achieve 50x better utilization. The cost is bimodal latency. Hot models (already in cache) serve at normal speed, for example p50 of 45ms for small CPU models or 15ms for small GPU convolutional neural networks. Cold loads add model fetch and deserialization time: 100 to 800ms for models under 100 megabytes on SSD cache, or 2 to 20 seconds for gigabyte scale natural language processing models pulled from S3 over network, depending on artifact size and decompression.
Multi-deployed patterns are used when p95 latency Service Level Objectives (SLOs) are strict and traffic is concentrated. Google Vertex AI customers running A/B tests with 95/5 canary splits keep both model versions hot, trading roughly 2x memory cost (both models resident) for stable p95 with negligible routing overhead (under 1 millisecond). This works when model count is low (typically 2 to 5 versions per endpoint) and each model fits comfortably in device memory. For large language models requiring 6 to 8 gigabytes of VRAM per 7 billion parameter model, multi-deployed on a single 40 gigabyte GPU is limited to 4 to 5 models maximum before memory pressure causes out of memory errors or Key-Value (KV) cache thrashing.
The failure mode to watch: if on demand loading faces bursty traffic hitting many cold models simultaneously, you get a cold start storm where load/evict cycles thrash, Input/Output (IO) saturates pulling artifacts, and p95 spikes by seconds. Mitigation requires pre-warming the top N models, pinning critical models in cache to prevent eviction, and admission control limiting parallel loads to 1 to 2 per host.
💡 Key Takeaways
•On demand loading achieves 3 to 10x cost reduction for long tail workloads (most models under 0.1 QPS) by sharing infrastructure, but cold loads add 100ms to 20 seconds latency
•Multi-deployed keeps all models resident in memory for stable p95, used for A/B testing where both model versions must have predictable latency, trading roughly 2x higher memory cost
•Cold start time correlates with artifact size: models under 100 megabytes load in 100 to 800ms from SSD, gigabyte scale models take 2 to 20 seconds from S3
•For 7 billion parameter large language models requiring 6 to 8 gigabytes VRAM each, a 40 gigabyte GPU supports at most 4 to 5 resident models before memory pressure causes failures
•Cold start storms occur when bursty traffic hits many cold models, causing load/evict thrashing and p95 spikes; mitigate by pinning top N models and limiting parallel loads to 1 to 2 per host
📌 Examples
Amazon SageMaker MME customer hosting 800 fraud detection models on 15 instances with on demand loading: p50 of 40ms hot, p95 of 1.2 seconds including occasional cold loads, versus $25K/month for dedicated endpoints
Google Vertex AI A/B test with two ranking models (200MB each) kept resident on 16GB GPU: both serve at p95 of 65ms, traffic split 95/5, zero cold starts over 2 week experiment
Uber trip pricing serving 50 city specific models on shared fleet: top 10 cities pinned in cache (70% of traffic), remaining 40 cities served on demand with p99 of 2 seconds during cold loads