Privacy & Fairness in ML • Fairness Metrics (Demographic Parity, Equalized Odds)Hard⏱️ ~2 min
Fairness Metrics Failure Modes and Edge Cases
Small sample sizes are the most common failure mode. If a cohort has only 50 positives, a few prediction errors can swing True Positive Rate (TPR) from 0.80 to 0.70, falsely triggering alerts. Set minimum cell counts of at least 200 positives and 200 negatives before computing equalized odds. Use Wilson confidence intervals or bootstrap resampling to quantify uncertainty. Configure alerting to consider confidence intervals, not just point estimates. A TPR gap of 0.08 with a confidence interval of plus or minus 0.10 is not actionable.
Label bias undermines equalized odds. If ground truth is biased, such as historical arrest data reflecting disparate policing intensity rather than true crime rates, equalized odds will faithfully reproduce that bias. The model learns to equalize error rates on biased labels, which does not remove harm. You need counterfactual labels, outcome audits, or external validity checks to detect and correct label bias. Label delay complicates monitoring. Default labels in credit arrive 60 to 90 days late. Teams use surrogate labels like early delinquency signals, which can drift from final outcomes.
Sensitive attribute coverage is often incomplete. Self reported gender or race might be missing for 20 to 50 percent of users. Imputation or proxy inference (inferring attributes from names or zip codes) introduces legal and ethical risk. Some jurisdictions restrict the use of sensitive attributes for decision making, even for fairness auditing. Per group thresholds to achieve equalized odds may be prohibited. Always document allowed use cases and obtain legal approval.
Fairness gerrymandering occurs when metrics pass for single attributes but fail for intersections. A model might satisfy demographic parity for gender and race separately, but violate it for intersections like Black women or Asian men. Intersectional cohorts often have smaller sample sizes, increasing variance and privacy risk. Compute intersectional slices where counts permit, using k anonymity safeguards (do not report metrics for cohorts under 50 users). Feedback loops are common: if a recommender downranks a cohort due to lower engagement, future engagement drops further, reinforcing disparity. Finally, adversarial gaming can emerge when thresholds differ by group. Applicants may misreport attributes to gain advantage, requiring audits that compare declared versus inferred attributes under strict privacy governance.
💡 Key Takeaways
•Small cohorts (under 200 positives or negatives) produce noisy TPR and FPR estimates. A few errors can swing metrics by 0.10, causing false alerts. Use confidence intervals and minimum cell size requirements
•Label bias breaks equalized odds. If training labels are biased (disparate policing, biased historical decisions), enforcing equal TPR faithfully reproduces that bias without removing harm
•Sensitive attribute coverage is incomplete (20 to 50% missing). Imputation or proxy inference introduces legal risk. Some jurisdictions prohibit using attributes even for fairness audits
•Fairness gerrymandering: Metrics pass for single attributes but fail for intersections (race by gender). Always compute intersectional slices where sample size permits (at least 50 users)
•Feedback loops amplify disparity. Downranking a cohort reduces future engagement and label quality for that cohort, reinforcing bias over time. Monitor engagement trends per cohort
•Adversarial gaming: Users misreport attributes when thresholds differ by group. Build audits comparing declared versus inferred attributes with strict privacy controls
📌 Examples
Credit model with 50 positives per cohort: TPR swings from 0.75 to 0.85 with 3 prediction changes, triggering false alerts. Solution: Require 200 positives before computing TPR
Hiring model trained on historical data: Equalizes TPR across gender, but historical labels reflect biased promotion decisions. Solution: Audit outcomes versus qualifications using external benchmarks
Recommender with 30% missing race data: Imputes race from zip code, introduces proxy discrimination. Solution: Document missing data rate, use only declared attributes, obtain legal approval
Fraud model passes parity for race and age separately but fails for young Black users: Intersectional cohort has only 80 samples. Solution: Report only cohorts with at least 200 users