Model Serving & InferenceMonitoring & Observability (Latency, Drift, Performance)Hard⏱️ ~3 min

Continuous Evaluation and Safe Rollout for LLMs

Why Continuous Evaluation Matters

LLMs are probabilistic and nondeterministic, making continuous evaluation and controlled rollouts essential. You cannot set and forget. Production teams maintain golden datasets segmented by scenario: short question answer, multi turn conversation, safety edge cases, domain specific tasks. These datasets run nightly in CI pipelines and after every model, prompt template, or provider change. Evaluation combines automated LLM as a judge scoring with periodic human review on sampled outputs.

Shadow and Canary Deployments

Shadow mode routes 1 to 10 percent of live traffic to the new model without serving results to users, logging outputs for offline comparison. This detects silent failures like refusal rate spikes, verbosity changes, or hallucination increases that only surface with real traffic diversity. Canary rollouts gradually shift 1 percent, 5 percent, 10 percent of traffic to the new model, with automatic rollback triggered when: p95 latency exceeds 2 seconds, refusal rate increases more than 2 percentage points, groundedness score drops below 0.75, or cost per request rises more than 30 percent.

Vendor Model Update Risks

When a provider ships version x plus 1, teams observe semantic style shifts, changed refusal rates, tokenization differences causing context overflow, or cost changes from longer default outputs. The mitigation is pinning model versions explicitly in production, recording decoding parameters (temperature, top k, top p) with each trace, and running A/B tests with matched prompts before promoting. Gate promotion on equal or better latency, cost, refusal rate, and groundedness at p95.

Observability Blind Spots

Over aggressive log sampling (1 percent sampling) removes the exact tail events needed to debug rare hallucinations or prompt injection attempts. High cardinality labels explode metrics storage, forcing teams to drop dimensions and hide cohort specific regressions. Security blind spots emerge when input output classification is missing; toxic or PII outputs slip through.

💡 Key Takeaways
Maintain golden datasets by scenario (short question answer, multi turn, safety) and run nightly in CI plus after every model or prompt change, combining LLM as a judge with periodic human review
Shadow deployments on 1 to 10 percent of traffic detect silent failures (refusal spikes, verbosity changes, hallucinations) without user impact before canary rollout begins
Canary rollouts gradually shift 1 percent, 5 percent, 10 percent of traffic with automatic rollback when p95 latency exceeds 2 seconds, refusal rate increases more than 2 percentage points, groundedness drops below 0.75, or cost rises more than 30 percent
Pin model versions explicitly and record decoding parameters (temperature, top k, top p) with each trace; vendor updates often cause semantic style shifts, refusal changes, or cost blowouts requiring A/B validation
Over aggressive log sampling (1 percent) removes tail events needed to debug rare hallucinations; high cardinality labels force dropping dimensions and hide cohort specific regressions like non English prompt degradation
📌 Interview Tips
1Meta content moderation: vendor LLM update changed refusal rate from 2 percent to 8 percent, caught by shadow deployment with A/B comparison before full rollout prevented user facing failures
2Netflix recommendation descriptions: shadow mode on 5 percent traffic detected 15 percent hallucination rate increase from stale retrieval index, triggered rollback and index refresh before canary
3Uber customer support: pinned model version with temperature equals 0.7 and top p equals 0.9 in production config, new provider default temperature equals 1.0 caused verbose responses costing 40 percent more tokens until config override deployed
4Airbnb search assistant: canary rollout to 10 percent triggered automatic rollback when p95 latency spiked from 1.8 seconds to 3.2 seconds due to longer context in new prompt template, reverted within 5 minutes
← Back to Monitoring & Observability (Latency, Drift, Performance) Overview