ML Infrastructure & MLOps • CI/CD for MLMedium⏱️ ~3 min
Model Registry and Lineage Capture
A model registry is the single source of truth for all trained model artifacts, serving as both a versioned storage system and a metadata database that captures exhaustive lineage for reproducibility and compliance. When a training job completes, it produces not just a model binary but a complete package including training code commit hash, feature definitions, data snapshot IDs (often URIs pointing to immutable partitions), hyperparameters, random seeds, hardware fingerprints like GPU type and driver version, training metrics, validation metrics, calibration plots, and fairness checks computed per traffic slice.
This lineage is critical because models are non deterministic artifacts. Floating point differences across hardware, multi threaded randomness in data loading, and unpinned dependencies can yield different weights even from identical code and data. Without captured environment fingerprints and seeds, you cannot reproduce a model to debug a production issue or roll back to a known good state. For example, if a canary shows a 3 percent drop in click through rate, you need to know exactly which data snapshot, feature transform version, and training environment produced both the candidate and the baseline to isolate the root cause.
The registry enforces promotion policies through automated gates. A candidate model must beat a persisted baseline by pre agreed margins, for example Area Under the Curve (AUC) improvement of at least 0.5 points and calibration slope within 0.02 on the top 5 traffic slices, before it becomes eligible for deployment. The policy can include fairness constraints like maximum precision difference of 2 percent across protected groups, and operational constraints like model size under 500 megabytes for mobile deployment or inference latency under 100 milliseconds at p99 on reference hardware.
Google's TensorFlow Extended (TFX) pipelines formalize this with components like ModelValidator that compare candidate metrics against thresholds and Pusher that only promotes if validation passes. Uber Michelangelo stores lineage in a relational schema tied to each model version and uses it to power dashboards showing training job history, offline metrics over time, and currently deployed versions per service. Meta's FBLearner Flow and internal platforms log every training run, A/B test result, and deployment event into a central experiment registry, enabling fast root cause analysis when online metrics shift and supporting compliance audits that require proof of data provenance and bias checks.
💡 Key Takeaways
•Exhaustive lineage includes training code hash, data snapshot IDs, feature definitions, hyperparameters, random seeds, hardware fingerprints, and computed metrics on validation slices for full reproducibility
•Promotion gates enforce automated policies before deployment: AUC improvement greater than 0.5 points, calibration slope within 0.02, fairness delta under 2 percent across protected groups, model size under 500MB for mobile
•Non determinism from floating point hardware differences, multi threaded randomness, and unpinned dependencies makes environment fingerprints essential to reproduce or roll back models reliably
•Offline metrics can lie: A candidate with 0.85 precision at k equals 10 on replay logs may degrade to 0.78 in production due to feature freshness lag or distribution shift, requiring online validation
•Model size and format matter for deployment: A 2 gigabyte TensorFlow SavedModel won't fit on mobile, quantized 200 megabyte version runs on device with 2 percent accuracy drop but 10x faster inference
•Centralized registries like Uber Michelangelo and Google Model Registry power dashboards showing training history, currently deployed versions per service, and A/B test results tied to each model version
📌 Examples
TFX ModelValidator component: Compares candidate AUC, precision, recall against baseline thresholds and slice metrics, only promotes if all gates pass, stores decision provenance in ML Metadata store
Uber fraud model lineage: Model version 127 trained on commit f3a9c with data snapshot covering 2024 Jan 10 to Jan 24, hyperparams learning_rate 0.001, batch_size 2048, deployed to 5% canary on Jan 26 with rollback to v126 on Jan 27 due to p99 latency spike
Netflix recommendation registry: Tracks which model versions serve which UI surfaces (homepage, search, post play), offline replay metrics (precision at 10, NDCG at 20), online A/B test results (CTR lift, watch time), and data snapshot timestamps for compliance audits