Model Serving & InferenceMulti-model ServingEasy⏱️ ~2 min

What is Multi-Model Serving?

Definition
Multi-model serving puts multiple machine learning models behind a single logical endpoint, where each request carries a model identifier that tells the system which model to invoke. Instead of deploying one endpoint per model, you share infrastructure across tens to thousands of models.

The Request Flow

The system routes each request based on model identity (passed explicitly in metadata or implicitly via routing rules), loads the target model if needed, and executes inference. This is fundamentally different from single model endpoints where one URL maps to one model. For example, SageMaker Multi-Model Endpoints customers commonly host 100 to 1000+ models on just 10 to 20 instances instead of dedicating one instance per model, achieving 3 to 10x cost reduction.

Three Core Patterns

On demand multi-model uses lazy loading where models are fetched from object storage on first request and cached in memory with LRU eviction, maximizing hardware utilization for long tail traffic. Multi-deployed endpoints keep multiple model versions loaded concurrently with fixed traffic splits for A/B testing, trading higher memory cost for stable latency. Gateway level aggregation routes through a reverse proxy to per model backend pools, maintaining isolation while offering centralized policy control.

Key Architectural Components

A request router extracts model identity from the request. A model registry tracks metadata like size and version. A model store (typically object storage like S3) holds the artifacts. A cache layer (in-memory or GPU) holds loaded models. Per model observability tracks metrics like p50/p95/p99 latency, cache hit rate, and cold start frequency for each model independently.

💡 Key Takeaways
Single endpoint serves multiple models by routing requests based on model identifier in metadata or URL path
Amazon SageMaker MME users achieve 3 to 10x cost reduction by consolidating hundreds to thousands of models on tens of instances instead of one per model
On demand loading maximizes utilization for long tail models with sparse traffic (under 0.1 QPS) but adds cold start latency of 100ms to 20 seconds depending on model size
Multi-deployed pattern keeps all models hot in memory for stable p95 latency, used by Google Vertex AI for A/B testing with 95/5 traffic splits
Gateway aggregation provides strong isolation by routing to dedicated per model backend pools while exposing a unified external API
📌 Interview Tips
1Stripe fraud detection serving 200+ merchant specific models behind one endpoint, each model under 1 QPS, shared fleet of 15 GPU instances with on demand loading
2Netflix recommendation system using multi-deployed endpoints to canary new ranking models with 90/10 traffic split, both versions kept resident in memory
3Meta TorchServe hosting 50 to 100 computer vision models per GPU instance with dynamic batching, routing by model name in request header
← Back to Multi-model Serving Overview