ML Infrastructure & MLOpsModel RegistryHard⏱️ ~3 min

Production Model Registry Architecture and Scale Requirements

A production registry separates control plane and data plane for reliability. The control plane handles metadata operations like registration, promotion, and approval using a strongly consistent store for critical writes such as stage transitions. Read queries serve through read replicas or a cache to achieve p95 latency under 10 milliseconds. The data plane stores model binaries in object storage with versioning enabled, replicating artifacts across regions where serving runs. Large models of 500 MB to 5 GB are stored in chunked form for efficient transfer with checksums and signed manifests. Regional caches are warmed ahead of promotion to keep p95 artifact load times under 5 seconds and avoid cold start spikes. Serving systems resolve model versions out of band from the inference request path. At process startup, a service queries the registry for the approved model version in its deployment binding, caches this result with a time to live (TTL) of 30 to 300 seconds, and prefetches the artifact from a nearby cache or regional bucket. A 1 GB model can take 10 to 60 seconds to download and 5 to 60 seconds to warm up in memory, so coordinated rollouts use background loading and atomic flips. During progressive exposure at 5%, 25%, 50%, 100%, the new model preloads in a background slot while the old model continues serving. Once warmed, the service flips atomically and keeps the old model resident for a grace period of 10 to 30 minutes to enable instant rollback without another download. At scale, the registry must handle hundreds of model groups, thousands of versions, and bursts of hundreds of writes per hour during retraining waves. Design targets include p95 metadata read latency under 10 milliseconds, write latency under 50 milliseconds, and artifact throughput of 10 Gbps per region to support parallel rollouts across hundreds of instances. Registry events publish to a durable queue so deployment pipelines can watch for approved versions and trigger orchestration. Services subscribe to these events to invalidate caches and prefetch new artifacts asynchronously. Safety mechanisms include artifact signing and verification on load, per model access control, encryption at rest and in transit, and full audit logging of registration, promotion, and access events. Optimistic locking with version fields prevents concurrent promotions from causing inconsistent states. Disaster recovery tests restore the registry from backups and rehydrate caches in a secondary region, ensuring recovery time objective (RTO) under 15 minutes.
💡 Key Takeaways
Control plane uses strongly consistent store for promotions with p95 read under 10ms, data plane replicates artifacts across regions with p95 load under 5 seconds
Serving systems resolve model version at startup with 30 to 300 second TTL cache, never querying registry on inference request path
Large models take 10 to 60 seconds to download and 5 to 60 seconds to warm, rollouts preload in background slot and flip atomically
Progressive exposure at 5%, 25%, 50%, 100% keeps old model resident for 10 to 30 minute grace period enabling instant rollback
Scale targets include hundreds of model groups, thousands of versions, hundreds of writes per hour, 10 Gbps artifact throughput per region
Safety requires artifact signing, per model access control, encryption, audit logging, optimistic locking, and disaster recovery under 15 minutes RTO
📌 Examples
Artifact download: 1 GB model takes 15 seconds over 10 Gbps link, 40 seconds to warm in memory, total 55 seconds before serving traffic
Rollout coordination: Service prefetches new model sha256:b4e3... in background while serving sha256:a3f2..., flips pointer when ready, keeps old model loaded for 20 minutes
Scale example: 500 models, 5000 total versions, 200 writes per day during retraining, 3000 reads per minute during deploy window across 1000 service instances
← Back to Model Registry Overview