Implementing Fairness Metrics in Production ML Pipelines
Where to Compute Fairness Metrics
Compute at multiple stages. During training: Monitor on validation set by group. If demographic parity ratio drops below 0.8, the model may face legal scrutiny. Before deployment: Compute on held-out test set. In production: Continuously monitor on live predictions. A fair model at deployment can drift as population changes. Typical frequency: daily checks, weekly deep-dives into subgroups.
Implementation Architecture
Store predictions with protected attributes in a separate auditing table, not the main prediction path. This isolates sensitive data with stricter access. Compute metrics asynchronously in batch, not real-time. Pipeline: predictions to Kafka, consumer writes to audit table with encrypted demographics, nightly batch computes metrics for dashboard. Alerts: demographic parity ratio below 0.8, equalized odds difference above 0.1.
Sample Size Requirements
Metrics require sufficient samples per group. With 1,000 predictions but only 50 in the minority group, error bars are huge. A 5% false positive rate with 50 samples could be 0% to 15%. Need at least 200-300 samples per group for meaningful differences. For rare groups, aggregate across time windows or use Bayesian methods that handle smaller samples.
Handling Missing Demographics
Often you lack demographic labels. Options: proxy variables (zip code correlates with race, but problematic), voluntary self-reporting (biased toward engaged users), probabilistic inference (BISG estimates race from name and geography at 70-90% accuracy). Missing data is not random: users who opt out may differ systematically.