Model Serving & InferenceMulti-model ServingMedium⏱️ ~2 min

Per-Model Observability: Metrics and Alerting Strategy

Why Per-Model Metrics Matter

In multi-model serving, aggregate metrics hide per model problems: one hot or misbehaving model can skew fleet wide averages while other models suffer silently. Effective observability requires per model metric collection tracking latency, throughput, cache hit rate, error rate, and resource consumption separately for each model identifier. This enables precise diagnosis (which model is slow?) and fair capacity allocation.

Critical Metrics per Model

P50/p95/p99 latency split by hot versus cold path, distinguishing normal inference time from cold load overhead. A model showing p50 of 40ms but p95 of 3 seconds indicates 5% cold hit rate. Cache hit rate (requests served from memory divided by total requests) quantifies cache effectiveness; under 80% suggests the model is thrashing. Request rate (QPS) and error rate (failures divided by requests) identify traffic hotspots and failing models. Memory footprint and GPU utilization per model reveal resource hogs that may need dedicated capacity.

Storage and Querying at Scale

Production systems use dimensioned metrics with model ID as a tag, stored in time series databases like Prometheus or CloudWatch. For a fleet of 500 models, this generates 500 times N metric streams (where N is metrics per model), requiring efficient storage and query patterns. Common practice is to pre-aggregate the top K models (top 50 by traffic) into high resolution dashboards and use lower resolution or sampled metrics for the long tail.

Alerting Strategy

Alert on per model SLO violations: if any single model exceeds p95 latency target (200ms) for 5 minutes, trigger an alert. Monitor cold start counts and eviction rates: sustained cold load rate over 10% suggests cache undersizing. The key insight is that multi-model systems are not one black box but N independent services sharing infrastructure, and you must observe them as such to maintain per model SLOs and debug hotspots or failures isolated to specific models.

💡 Key Takeaways
Per model metrics are essential because aggregate metrics hide individual model problems; one hot model can skew fleet wide p95 while other models suffer
Critical metrics per model: p50/p95/p99 split by hot versus cold path, cache hit rate (target over 80%), request rate (QPS), error rate, and memory footprint
Bimodal latency (for example, p50 of 40ms but p95 of 3 seconds) indicates 5% cold hit rate; cache hit rate under 80% suggests model is thrashing and needs pinning or larger cache
Use dimensioned metrics with model ID tag in Prometheus or CloudWatch; pre-aggregate top 50 models into high resolution dashboards, sample long tail to control storage cost
Alert on per model SLO violations (for example, p95 over 200ms for 5 minutes) and cold load rate over 10%, which signals cache undersizing or traffic surge
📌 Interview Tips
1Amazon SageMaker MME emits per model CloudWatch metrics: ModelInvocationLatency, ModelLoadingTime, ModelCacheHit; customer sets alarm when any model p95 exceeds 500ms for 10 minutes
2Meta TorchServe deployment tracking 80 models: Grafana dashboard showing per model QPS and p95 latency heatmap, alerts fire when any model error rate exceeds 1% or cache hit drops below 75%
3Netflix recommendation system: per model metrics include prediction_staleness (hours since last retrain), feature_extraction_ms, and inference_ms; alerts trigger if staleness exceeds 48 hours for core ranking models
← Back to Multi-model Serving Overview