Failure Modes: Label Bias, Seasonality, and Slice Degradation
LABEL BIAS
Labels themselves can be biased, leading to misleading performance metrics. If human labelers are biased, or if the labeling process is inconsistent, measured accuracy does not reflect true accuracy.
Example: Fraud labels come from investigations. Investigators prioritize high-value transactions. Low-value fraud is under-investigated and under-labeled. Model accuracy on low-value transactions appears high but may be low in reality.
Detection: Track labeling patterns across segments. Are some segments labeled more completely? Compare label rates to expected rates from domain knowledge.
Mitigation: Use stratified evaluation. Sample transactions for manual review regardless of model prediction. This provides unbiased ground truth.
SEASONALITY EFFECTS
Performance naturally varies by season. Holiday shopping patterns differ from normal patterns. A model performing well in January may struggle in December due to seasonal shift, not degradation.
Detection: Compare current metrics to same-period-last-year, not just recent average. Use seasonal decomposition to separate trend from seasonality.
Response: Do not alert on expected seasonal variation. Set seasonally-adjusted thresholds. Retrain with recent seasonal data before high-stakes periods.
SLICE DEGRADATION
Aggregate metrics may be stable while specific segments degrade significantly. A model maintaining 90% overall accuracy might drop to 60% accuracy for a specific user segment representing 5% of traffic.
Detection: Track metrics per segment. Define critical segments (high-value users, key product categories, important geographies). Set per-segment thresholds.
Response: Investigate segment-specific issues. May need segment-specific models or additional training data for underperforming segments.