What is Multi-Model Serving?
The Request Flow
The system routes each request based on model identity (passed explicitly in metadata or implicitly via routing rules), loads the target model if needed, and executes inference. This is fundamentally different from single model endpoints where one URL maps to one model. For example, SageMaker Multi-Model Endpoints customers commonly host 100 to 1000+ models on just 10 to 20 instances instead of dedicating one instance per model, achieving 3 to 10x cost reduction.
Three Core Patterns
On demand multi-model uses lazy loading where models are fetched from object storage on first request and cached in memory with LRU eviction, maximizing hardware utilization for long tail traffic. Multi-deployed endpoints keep multiple model versions loaded concurrently with fixed traffic splits for A/B testing, trading higher memory cost for stable latency. Gateway level aggregation routes through a reverse proxy to per model backend pools, maintaining isolation while offering centralized policy control.
Key Architectural Components
A request router extracts model identity from the request. A model registry tracks metadata like size and version. A model store (typically object storage like S3) holds the artifacts. A cache layer (in-memory or GPU) holds loaded models. Per model observability tracks metrics like p50/p95/p99 latency, cache hit rate, and cold start frequency for each model independently.