Production Failure Modes and Edge Case Handling
UPSTREAM PIPELINE FAILURES
ML features depend on upstream data pipelines. When an upstream job fails silently (produces wrong data without errors), features become corrupted. The ML model sees garbage and produces garbage predictions.
Detection: Monitor upstream job completion and data freshness. Set expectations on feature staleness. A feature that should update hourly but has not updated in 3 hours indicates pipeline failure.
Response: Fall back to cached or default values. Alert data engineering. Block model serving if critical features are unavailable.
SCHEMA EVOLUTION ISSUES
Upstream data schemas change: new columns added, columns renamed, types changed. If ML pipelines are not updated, they may read wrong columns or fail to parse.
Detection: Validate schema at ingestion. Check column names, types, and cardinality against expectations. Fail fast when schema violations occur.
Prevention: Use schema registries. Require backward compatibility for schema changes. Version schemas and feature definitions together.
CARDINALITY EXPLOSION
Categorical features can have new values appear in production that were not in training. A country feature trained on 50 countries suddenly sees traffic from 200 countries. Unknown categories cause prediction errors or silent degradation.
Detection: Monitor cardinality over time. Alert when new categories appear. Track percentage of requests with unknown categories.
Response: Map unknown categories to an OTHER bucket. Retrain periodically to incorporate new categories. For critical features, fail requests with unknown categories.
GRADUAL DEGRADATION
Not all failures are sudden. Data quality can degrade gradually: null rates increasing 0.5% per week, feature means drifting slowly. By the time anyone notices, significant damage is done.