Core Concept
A model registry is both versioned storage and metadata database. When training completes, it captures not just the model binary but full lineage: code commit, feature definitions, data snapshot IDs, hyperparameters, seeds, hardware fingerprints, metrics, and fairness checks.
WHY LINEAGE MATTERS
Models are non-deterministic artifacts. Floating-point differences across hardware, multi-threaded randomness in data loading, and unpinned dependencies can yield different weights from identical code and data. Without captured environment fingerprints and seeds, you cannot reproduce a model to debug production issues or roll back reliably.
💡 Example: If a canary shows 3% CTR drop, you need to know exactly which data snapshot, feature transform version, and training environment produced both candidate and baseline to isolate root cause.
AUTOMATED PROMOTION GATES
The registry enforces promotion policies. A candidate must beat a persisted baseline by pre-agreed margins before deployment eligibility:
• Statistical: AUC improvement ≥0.5 points, calibration slope within 0.02 on top 5 traffic slices
• Fairness: Maximum precision difference of 2% across protected groups
• Operational: Model size under 500MB (mobile), inference latency under 100ms p99
PRODUCTION IMPLEMENTATION
ML pipelines formalize this with validator components that compare candidate metrics against thresholds and only promote if validation passes. Central experiment registries log every training run, A/B test result, and deployment event—enabling fast root cause analysis when online metrics shift and supporting compliance audits requiring data provenance proof.
✓Exhaustive lineage includes training code hash, data snapshot IDs, feature definitions, hyperparameters, random seeds, hardware fingerprints, and computed metrics on validation slices for full reproducibility
✓Promotion gates enforce automated policies before deployment: AUC improvement greater than 0.5 points, calibration slope within 0.02, fairness delta under 2 percent across protected groups, model size under 500MB for mobile
✓Non determinism from floating point hardware differences, multi threaded randomness, and unpinned dependencies makes environment fingerprints essential to reproduce or roll back models reliably
✓Offline metrics can lie: A candidate with 0.85 precision at k equals 10 on replay logs may degrade to 0.78 in production due to feature freshness lag or distribution shift, requiring online validation
✓Model size and format matter for deployment: A 2 gigabyte TensorFlow SavedModel won't fit on mobile, quantized 200 megabyte version runs on device with 2 percent accuracy drop but 10x faster inference
✓Centralized registries like Uber Michelangelo and Google Model Registry power dashboards showing training history, currently deployed versions per service, and A/B test results tied to each model version
1TFX ModelValidator component: Compares candidate AUC, precision, recall against baseline thresholds and slice metrics, only promotes if all gates pass, stores decision provenance in ML Metadata store
2Uber fraud model lineage: Model version 127 trained on commit f3a9c with data snapshot covering 2024 Jan 10 to Jan 24, hyperparams learning_rate 0.001, batch_size 2048, deployed to 5% canary on Jan 26 with rollback to v126 on Jan 27 due to p99 latency spike
3Netflix recommendation registry: Tracks which model versions serve which UI surfaces (homepage, search, post play), offline replay metrics (precision at 10, NDCG at 20), online A/B test results (CTR lift, watch time), and data snapshot timestamps for compliance audits