ML Infrastructure & MLOps • Model RegistryEasy⏱️ ~3 min
What is a Model Registry and Why Production ML Needs It
A model registry is the control plane for machine learning models in production. Think of it as version control for models, but much more powerful. It assigns a unique identity to every trained model artifact and tracks its complete lineage, including training data snapshots, code versions, hyperparameters, evaluation metrics, approval status, and deployment bindings. Unlike simply storing model files in a shared folder, the registry is the authoritative source of truth that answers critical questions. What model version should serve production traffic right now? How was this model trained and on what data? What versions exist and how do they compare on business metrics?
The registry bridges the gap between training pipelines and serving systems. When a training run completes, it writes a 300 MB to 3 GB artifact to object storage and registers a new model version with a content hash, training metadata, and evaluation results on held out data. An automated evaluator compares the new version against baselines on business metrics and safety constraints like calibration shift and fairness rules. If it passes, the version becomes a candidate for staging. A deployment pipeline watches registry events and orchestrates progressive rollout, starting at 5% traffic for canary analysis, then 50%, then 100%. During this rollout, serving systems resolve which model to load at startup through a cached lookup, not on every request.
At scale, companies like Uber, Airbnb, and Meta rely on centralized model registries to coordinate hundreds of models and thousands of versions. Uber Michelangelo stores models, metadata, and approval status with deploy time bindings to prevent model drift. Airbnb Bighead standardized how teams register models and attach metrics, enabling automated promotion when checks pass. Meta FBLearner Flow ties models to training data snapshots and code versions for traceability across thousands of models. These systems handle hundreds of writes per day and thousands of reads per minute during deploy windows, maintaining p95 metadata lookup latency below 10 milliseconds.
Without a registry, teams face model code skew where services load the wrong model version, inconsistent rollbacks, poor audit trails for compliance, and dangerous race conditions during deployments. The registry adds process overhead and requires metadata discipline, but for any organization running multiple models or requiring reproducibility, it becomes essential infrastructure.
💡 Key Takeaways
•Registry assigns unique identity to each model version using content hash, tracks complete lineage including training data, code, hyperparameters, and evaluation metrics
•Serves as control plane answering what model to use now, how it was trained, and how versions compare on business metrics
•Production systems resolve model version at startup with cached lookup under 10ms p95, not on every inference request to avoid latency impact
•Enables safe progressive rollouts starting at 5% traffic for canary analysis, automatically promoting to 100% if guardrails pass
•Prevents model code skew by binding model versions to application releases and enforcing compatibility checks during promotion
•At scale handles hundreds of model groups, thousands of versions, hundreds of writes per day, thousands of reads per minute during deploys
📌 Examples
Uber Michelangelo: Central model catalog stores models with approval status and deploy time bindings to prevent drift across hundreds of models
Airbnb Bighead: Model repository standardizes registration, attaches offline and online metrics, automates promotion when evaluation checks pass
Meta FBLearner Flow: Integrates evaluation, approval, and deployment with registry layer tying models to training data snapshots for traceability
Netflix: Model catalogs integrate with canary systems so model upgrades follow same progressive rollout as service binaries with automated rollback