Privacy & Fairness in ML • Bias Detection & MitigationMedium⏱️ ~3 min
Fairness Metrics: Group, Individual, and Calibration Parity
Fairness is not a single metric but a collection of incompatible mathematical definitions. The most fundamental split is between group fairness, which compares aggregate statistics across protected groups, and individual fairness, which requires similar treatment for similar individuals. No single model can simultaneously optimize all fairness definitions, forcing engineers to make explicit tradeoffs based on legal requirements and business context.
Group fairness includes several key metrics. Statistical parity requires equal positive prediction rates across groups, for example 30% approval rate for both Group A and Group B. Disparate impact ratio measures the ratio of these rates, with the four fifths rule setting 0.8 as a common legal threshold. Equal opportunity requires equal True Positive Rates (TPRs) across groups, ensuring that qualified members of each group have equal chances of positive outcomes. Equalized odds goes further, requiring both equal TPRs and equal False Positive Rates (FPRs). A credit model might show 75% TPR for Group A but 68% TPR for Group B, a 7 percentage point gap that would violate equal opportunity.
Calibration parity demands that predictions mean the same thing across groups. If the model assigns 70% probability, then 70% of individuals with that score should experience the positive outcome, regardless of group membership. Research has proven that you cannot simultaneously achieve calibration parity and equalized odds when base rates differ across groups, creating an impossible constraint. A justice risk model calibrated to predict 30% recidivism cannot also maintain equal FPRs when actual recidivism rates differ between demographics.
Production systems at major platforms typically track 5 to 8 fairness metrics simultaneously. Google's credit model monitoring includes disparate impact ratio with 0.8 to 1.25 acceptable range, TPR gaps under 5 percentage points, and calibration curves per demographic. Meta's ad ranking systems monitor exposure fairness, tracking impression distributions across creator demographics alongside click through rate. These systems run daily batch jobs processing 100 million to 1 billion predictions, computing confidence intervals using stratified bootstrap, and alerting when metrics exceed thresholds for two consecutive 24 hour windows.
💡 Key Takeaways
•Incompatibility theorem: Calibration parity and equalized odds cannot both hold when base rates differ across groups, forcing explicit tradeoff choices
•Four fifths rule: Disparate impact ratio below 0.8 triggers legal scrutiny in employment and lending, common threshold is 0.8 to 1.25 acceptable range
•Production targets: Google credit models enforce TPR gaps under 5 percentage points with 95% confidence, blocking promotion if violated
•Individual versus group: Group constraints can cause similar individuals to receive different outcomes, but individual fairness is hard to measure at scale of 1 billion predictions daily
•Metric explosion: Monitoring all intersections like race by sex by age by region multiplies cohorts from 5 to 500+, requiring hierarchical testing and false discovery rate control
•Base rate sensitivity: A 10 percentage point difference in base rates between groups makes equalized odds 3 to 5 percentage points more costly in overall accuracy
📌 Examples
Credit risk model at 10,000 QPS tracks disparate impact ratio, TPR gap, and calibration curves per demographic, with daily batch jobs completing in under 15 minutes for 100M predictions
Meta ad ranking monitors exposure fairness across creator demographics, accepting 1 to 3% short term CTR reduction to equalize impression distribution and improve ecosystem health
Justice risk assessment tools calibrated for 30% recidivism cannot maintain equal FPRs when actual recidivism differs by 15 percentage points across groups