Model Serving & InferenceMonitoring & Observability (Latency, Drift, Performance)Hard⏱️ ~3 min

Detecting Model Drift: Data, Concept, and Semantic Shifts

Three Types of Drift

Drift is the degradation of model behavior when live data distributions, input output mappings, or external context change over time. It manifests in three layers. Data drift occurs when feature distributions shift, detected via PSI, KL divergence, or KS tests on features and embeddings. A PSI greater than 0.2 to 0.3 warrants investigation, while PSI greater than 0.4 signals high risk. Concept drift means the underlying input output relationship changes, even if input distributions stay stable. Semantic drift in LLMs includes style shifts, refusal rate changes, hallucinations, or compliance failures, often triggered by vendor model updates.

Layered Detection Approach

Production drift detection uses a layered approach combining statistical signals, model side signals, semantic alignment checks, and product metrics. Monitor perplexity or loss on a fixed versioned evaluation set; rising values beyond a control band indicate model or data drift. For LLMs, track embedding similarity between outputs and retrieved sources to measure groundedness. Product metrics like CTR, CSAT, deflection rates, and re-ask rates provide business level signals.

Sensitivity vs Alert Fatigue

Low PSI thresholds catch issues early but page teams on benign seasonal shifts. Use multi window detectors that compare short term windows (1 day) to medium term baselines (7 to 14 days) and long term seasonal patterns (same weekday last month). Composite triggers requiring both statistical drift (PSI greater than 0.3) and quality degradation (perplexity up 10 percent) reduce noise.

Silent Drift Failures

Stale retrieval indexes or null inflation in feature pipelines create groundedness loss; the LLM fills in gaps with hallucinations that pass validation but fail fact checking. Vendor model updates change tokenization, refusal policies, or decoding defaults, causing cost or style shifts that break downstream user experience. Over aggressive log sampling removes the tail events needed to debug drift.

💡 Key Takeaways
Data drift detected via PSI (Population Stability Index), KL divergence, or KS tests: PSI greater than 0.2 to 0.3 warrants investigation, PSI greater than 0.4 signals high risk requiring immediate action
Concept drift manifests as rising perplexity or loss on a fixed versioned evaluation set, indicating the input output mapping has changed even if feature distributions remain stable
Semantic drift in LLMs includes style shifts, refusal rate changes, and hallucination spikes, often caused by vendor model updates that alter tokenization or decoding defaults
Layered detection combines statistical signals (PSI, KL), model signals (perplexity), semantic checks (embedding similarity, groundedness), and product metrics (CTR, CSAT, deflection rate) to reduce false positives
Use multi window detectors with seasonality aware baselines: compare 1 day short term to 7 to 14 day medium term and same weekday last month to filter benign seasonal shifts and reduce alert fatigue
📌 Interview Tips
1Airbnb pricing model: PSI monitoring on booking features flagged 0.35 PSI spike during holiday season, combined with stable conversion rate indicated seasonal shift not model failure
2Uber ETA prediction: null inflation in traffic feature pipeline caused 15 percent accuracy drop, detected via rising Mean Absolute Error (MAE) before PSI triggered, requiring feature schema validation
3Meta content moderation: vendor LLM update changed refusal rate from 2 percent to 8 percent, caught by shadow deployment showing style shift before full rollout, requiring prompt template adjustment
4Netflix recommendation: embedding similarity between generated descriptions and source metadata dropped from 0.82 to 0.68 after index refresh, signaling stale retrieval causing hallucinations
← Back to Monitoring & Observability (Latency, Drift, Performance) Overview