Learn→Model Serving & Inference→Monitoring & Observability (Latency, Drift, Performance)→4 of 6

Model Serving & Inference • Monitoring & Observability (Latency, Drift, Performance)Hard⏱️ ~3 min

Continuous Evaluation and Safe Rollout for LLMs

Why Continuous Evaluation Matters
LLMs are probabilistic and nondeterministic, making continuous evaluation and controlled rollouts essential. You cannot set and forget. Production teams maintain golden datasets segmented by scenario: short question answer, multi turn conversation, safety edge cases, domain specific tasks. These datasets run nightly in CI pipelines and after every model, prompt template, or provider change. Evaluation combines automated LLM as a judge scoring with periodic human review on sampled outputs.
Shadow and Canary Deployments
Shadow mode routes 1 to 10 percent of live traffic to the new model without serving results to users, logging outputs for offline comparison. This detects silent failures like refusal rate spikes, verbosity changes, or hallucination increases that only surface with real traffic diversity. Canary rollouts gradually shift 1 percent, 5 percent, 10 percent of traffic to the new model, with automatic rollback triggered when: p95 latency exceeds 2 seconds, refusal rate increases more than 2 percentage points, groundedness score drops below 0.75, or cost per request rises more than 30 percent.
Vendor Model Update Risks
When a provider ships version x plus 1, teams observe semantic style shifts, changed refusal rates, tokenization differences causing context overflow, or cost changes from longer default outputs. The mitigation is pinning model versions explicitly in production, recording decoding parameters (temperature, top k, top p) with each trace, and running A/B tests with matched prompts before promoting. Gate promotion on equal or better latency, cost, refusal rate, and groundedness at p95.
Observability Blind Spots
Over aggressive log sampling (1 percent sampling) removes the exact tail events needed to debug rare hallucinations or prompt injection attempts. High cardinality labels explode metrics storage, forcing teams to drop dimensions and hide cohort specific regressions. Security blind spots emerge when input output classification is missing; toxic or PII outputs slip through.

💡 Key Takeaways

✓Maintain golden datasets by scenario (short question answer, multi turn, safety) and run nightly in CI plus after every model or prompt change, combining LLM as a judge with periodic human review

✓Shadow deployments on 1 to 10 percent of traffic detect silent failures (refusal spikes, verbosity changes, hallucinations) without user impact before canary rollout begins

✓Canary rollouts gradually shift 1 percent, 5 percent, 10 percent of traffic with automatic rollback when p95 latency exceeds 2 seconds, refusal rate increases more than 2 percentage points, groundedness drops below 0.75, or cost rises more than 30 percent

✓Pin model versions explicitly and record decoding parameters (temperature, top k, top p) with each trace; vendor updates often cause semantic style shifts, refusal changes, or cost blowouts requiring A/B validation

✓Over aggressive log sampling (1 percent) removes tail events needed to debug rare hallucinations; high cardinality labels force dropping dimensions and hide cohort specific regressions like non English prompt degradation

📌 Interview Tips

1Meta content moderation: vendor LLM update changed refusal rate from 2 percent to 8 percent, caught by shadow deployment with A/B comparison before full rollout prevented user facing failures

2Netflix recommendation descriptions: shadow mode on 5 percent traffic detected 15 percent hallucination rate increase from stale retrieval index, triggered rollback and index refresh before canary

3Uber customer support: pinned model version with temperature equals 0.7 and top p equals 0.9 in production config, new provider default temperature equals 1.0 caused verbose responses costing 40 percent more tokens until config override deployed

4Airbnb search assistant: canary rollout to 10 percent triggered automatic rollback when p95 latency spiked from 1.8 seconds to 3.2 seconds due to longer context in new prompt template, reverted within 5 minutes

← Back to Monitoring & Observability (Latency, Drift, Performance) Overview