Per-Model Observability: Metrics and Alerting Strategy
Why Per-Model Metrics Matter
In multi-model serving, aggregate metrics hide per model problems: one hot or misbehaving model can skew fleet wide averages while other models suffer silently. Effective observability requires per model metric collection tracking latency, throughput, cache hit rate, error rate, and resource consumption separately for each model identifier. This enables precise diagnosis (which model is slow?) and fair capacity allocation.
Critical Metrics per Model
P50/p95/p99 latency split by hot versus cold path, distinguishing normal inference time from cold load overhead. A model showing p50 of 40ms but p95 of 3 seconds indicates 5% cold hit rate. Cache hit rate (requests served from memory divided by total requests) quantifies cache effectiveness; under 80% suggests the model is thrashing. Request rate (QPS) and error rate (failures divided by requests) identify traffic hotspots and failing models. Memory footprint and GPU utilization per model reveal resource hogs that may need dedicated capacity.
Storage and Querying at Scale
Production systems use dimensioned metrics with model ID as a tag, stored in time series databases like Prometheus or CloudWatch. For a fleet of 500 models, this generates 500 times N metric streams (where N is metrics per model), requiring efficient storage and query patterns. Common practice is to pre-aggregate the top K models (top 50 by traffic) into high resolution dashboards and use lower resolution or sampled metrics for the long tail.
Alerting Strategy
Alert on per model SLO violations: if any single model exceeds p95 latency target (200ms) for 5 minutes, trigger an alert. Monitor cold start counts and eviction rates: sustained cold load rate over 10% suggests cache undersizing. The key insight is that multi-model systems are not one black box but N independent services sharing infrastructure, and you must observe them as such to maintain per model SLOs and debug hotspots or failures isolated to specific models.