Privacy & Fairness in MLFairness Metrics (Demographic Parity, Equalized Odds)Medium⏱️ ~2 min

What is Equalized Odds?

Equalized odds is a fairness constraint that requires both True Positive Rate (TPR) and False Positive Rate (FPR) to be equal across sensitive groups. Unlike demographic parity, it conditions on the true label: P(Ŷ = 1 | Y = 1, A = a) equals P(Ŷ = 1 | Y = 1, A = b) for qualified applicants, and P(Ŷ = 1 | Y = 0, A = a) equals P(Ŷ = 1 | Y = 0, A = b) for unqualified ones. This ensures equal error rates and equal opportunity conditional on actual qualification. The metric requires ground truth labels, which creates operational complexity. In credit scoring, default labels arrive 30 to 90 days after decisions. Teams cannot compute equalized odds in real time. Instead, Google TensorFlow Model Analysis and Microsoft Fairlearn compute TPR and FPR gaps offline in nightly batch jobs that join delayed labels to logged predictions. These jobs process millions of predictions in minutes on distributed compute, slicing by gender, age, region, and intersectional cohorts. A relaxed variant called equal opportunity enforces only equal TPR, ignoring FPR. This is useful when false negatives are more costly than false positives, such as missing qualified candidates in hiring or denying credit to creditworthy borrowers. Amazon SageMaker Clarify surfaces equal opportunity TPR gaps in model cards required for launch reviews, targeting gaps under 5 percentage points. The core tradeoff is between equalized odds and calibration. When base rates differ across groups, you cannot achieve both perfect calibration and equalized odds simultaneously. Enforcing equalized odds typically requires different decision thresholds per group or randomized decisions, which can be difficult to explain and may violate regulations in some domains like lending.
💡 Key Takeaways
Equalized odds requires equal TPR and FPR across groups conditional on true labels, ensuring equal error rates for qualified and unqualified subpopulations
Requires delayed labels (30 to 90 days in credit systems), making real time monitoring impossible. Teams compute weekly or monthly with batch jobs joining labels to logged predictions
Equal opportunity is a relaxed variant enforcing only equal TPR, useful when false negatives are more harmful. Amazon targets TPR gaps under 5 percentage points for model launch
Conflicts with calibration when base rates differ. You cannot have both perfect calibration and equalized odds, forcing a choice between consistent risk thresholds and equal error rates
Enforcement typically requires per group decision thresholds or randomized decisions, which may be legally prohibited in regulated domains like lending or insurance
Small cohorts cause noisy estimates. Require at least 200 positives and 200 negatives per group before trusting TPR and FPR measurements, using confidence intervals for alerting
📌 Examples
Microsoft hiring system: Evaluates 2 million applicants per year, computes TPR and FPR for 40 to 80 cohorts (region x gender x seniority) with minimum 200 samples per cell
Google fraud detection: TensorFlow Model Analysis slices by gender and age, surfaces TPR gaps in model cards for launch review, blocks promotion if gaps exceed thresholds
Meta credit underwriting: Nightly batch job joins 90 day default labels to 5,000 hourly decisions, computes equalized odds gaps, triggers threshold reoptimization if gaps exceed 0.05
← Back to Fairness Metrics (Demographic Parity, Equalized Odds) Overview