Failure Modes: Proxy Leakage and Feedback Loops
Proxy Variable Leakage
You removed race from features, but zip code predicts race with 85% accuracy. You removed gender, but first name predicts gender with 95% accuracy. Models find proxies. Detection: Train a classifier to predict protected attribute from model features. If AUC exceeds 0.7, significant proxy information exists. If it exceeds 0.9, features are nearly as informative as the protected attribute itself. Mitigation: Remove high-correlation proxies, but this may hurt accuracy. Adversarial debiasing actively penalizes models that leak protected attribute information. Feature importance analysis shows which features drive group differences.
Feedback Loops
Biased predictions create biased outcomes that become biased training data. If a hiring model rejects Group B candidates at higher rates, Group B has fewer success cases in future training data, making the model even more biased against Group B. Over 5 retraining cycles, initial 5% bias can amplify to 25%. Detection: Track fairness metrics over model versions. If disparity increases with each retrain, feedback loop is operating. Mitigation: Holdout exploration: reserve 5-10% of decisions for random assignment regardless of model prediction. Counterfactual logging: estimate what would have happened under different decisions.
Label Bias
Ground truth labels themselves may be biased. Performance reviews used as labels reflect reviewer bias. Fraud labels may be based on investigations that targeted certain groups. If you optimize for biased labels, you encode that bias into the model. Detection: Audit label generation process. If humans assigned labels, check inter-rater agreement by annotator demographics. Mitigation: Multiple annotators with diversity requirements. Calibration against objective outcomes where available. Accept that some bias may be unfixable without changing the labeling process entirely.