Model Monitoring & ObservabilityData Quality MonitoringHard⏱️ ~3 min

Production Failure Modes and Edge Case Handling

UPSTREAM PIPELINE FAILURES

ML features depend on upstream data pipelines. When an upstream job fails silently (produces wrong data without errors), features become corrupted. The ML model sees garbage and produces garbage predictions.

Detection: Monitor upstream job completion and data freshness. Set expectations on feature staleness. A feature that should update hourly but has not updated in 3 hours indicates pipeline failure.

Response: Fall back to cached or default values. Alert data engineering. Block model serving if critical features are unavailable.

SCHEMA EVOLUTION ISSUES

Upstream data schemas change: new columns added, columns renamed, types changed. If ML pipelines are not updated, they may read wrong columns or fail to parse.

Detection: Validate schema at ingestion. Check column names, types, and cardinality against expectations. Fail fast when schema violations occur.

Prevention: Use schema registries. Require backward compatibility for schema changes. Version schemas and feature definitions together.

CARDINALITY EXPLOSION

Categorical features can have new values appear in production that were not in training. A country feature trained on 50 countries suddenly sees traffic from 200 countries. Unknown categories cause prediction errors or silent degradation.

Detection: Monitor cardinality over time. Alert when new categories appear. Track percentage of requests with unknown categories.

Response: Map unknown categories to an OTHER bucket. Retrain periodically to incorporate new categories. For critical features, fail requests with unknown categories.

GRADUAL DEGRADATION

Not all failures are sudden. Data quality can degrade gradually: null rates increasing 0.5% per week, feature means drifting slowly. By the time anyone notices, significant damage is done.

⚠️ Key Trade-off: Comprehensive monitoring catches more issues but costs more and produces more alerts. Focus monitoring intensity on features with highest business impact.
💡 Key Takeaways
Upstream failures: monitor job completion and feature freshness; fall back to cached values; block serving if critical
Schema evolution: validate at ingestion against expectations; use schema registries; version together with features
Cardinality explosion: new categories in production cause errors; map unknown to OTHER bucket; retrain to incorporate
📌 Interview Tips
1Interview Tip: Explain how upstream pipeline failures propagate silently to model predictions.
2Interview Tip: Describe handling unknown categorical values: OTHER bucket, alerting, periodic retraining.
← Back to Data Quality Monitoring Overview
Production Failure Modes and Edge Case Handling | Data Quality Monitoring - System Overflow