Failure Modes and Edge Cases in Model Explanations
Explanation Instability
LIME explanations can change with small input perturbations. Change one feature by 0.1% and top features might reorder completely. This undermines trust: "Why did income matter more for my application but credit score for my neighbor?" SHAP is more stable but not immune. Solution: report confidence intervals. If "income importance: 0.3 ± 0.2," users understand uncertainty. Requires 3-5x explanation rounds.
Feature Correlation Problems
SHAP and LIME assume feature independence. If income and education are highly correlated (0.8), attribution between them becomes arbitrary. Removing one correlated feature is misleading because you implicitly change correlated ones. Detection: flag pairs above 0.7 correlation as unreliable for individual attribution. Mitigation: group correlated features and explain group importance instead.
Adversarial Explanations
Explanations can be manipulated. Attackers craft inputs producing misleading explanations while maintaining predictions. The model makes biased decisions but explanations hide bias by attributing to innocuous features. Detection: compare explanations for protected vs unprotected groups. If explanations differ dramatically while predictions are similar, investigate.
Out of Distribution Inputs
Explanations are unreliable outside training distribution. The model extrapolates unpredictably, and LIME/SHAP become meaningless. A model trained on K-K incomes produces nonsense for M. Detection: flag inputs far from training centroid. Warn or refuse to explain entirely.