ML Infrastructure & MLOpsModel RegistryHard⏱️ ~3 min

Production Model Registry Architecture and Scale Requirements

Architecture Principle
A production registry separates control plane (metadata) from data plane (artifacts). This isolation ensures metadata queries stay fast even during large artifact transfers.

CONTROL PLANE

Handles registration, promotion, approval. Uses strongly consistent store for critical writes (stage transitions). Reads serve through replicas or cache for p95 under 10ms. Typical load: hundreds of models, thousands of versions, bursts of hundreds of writes/hour during retraining.

DATA PLANE

Model binaries live in object storage with versioning, replicated across serving regions. Large models (500MB-5GB) store chunked with checksums. Regional caches warm ahead of promotion to keep artifact load under 5 seconds.

💡 Insight: Version resolution happens out-of-band from inference. Services cache versions, prefetch artifacts, and flip atomically—the hot path never queries the registry.

PROGRESSIVE ROLLOUT

At 5%/25%/50%/100%, the new model preloads in a background slot while the old serves. Once warmed, flip atomically. Keep the old model resident for 10-30 minutes for instant rollback without re-downloading.

SCALE TARGETS

Design for: p95 metadata read under 10ms, write under 50ms, artifact throughput 10 Gbps/region for parallel rollouts. Events publish to a durable queue so pipelines can trigger on approved versions.

⚠️ Safety: Artifact signing, per-model access control, encryption at rest/transit, full audit logging. Optimistic locking prevents concurrent promotion conflicts.
💡 Key Takeaways
Control plane uses strongly consistent store for promotions with p95 read under 10ms, data plane replicates artifacts across regions with p95 load under 5 seconds
Serving systems resolve model version at startup with 30 to 300 second TTL cache, never querying registry on inference request path
Large models take 10 to 60 seconds to download and 5 to 60 seconds to warm, rollouts preload in background slot and flip atomically
Progressive exposure at 5%, 25%, 50%, 100% keeps old model resident for 10 to 30 minute grace period enabling instant rollback
Scale targets include hundreds of model groups, thousands of versions, hundreds of writes per hour, 10 Gbps artifact throughput per region
Safety requires artifact signing, per model access control, encryption, audit logging, optimistic locking, and disaster recovery under 15 minutes RTO
📌 Interview Tips
1Artifact download: 1 GB model takes 15 seconds over 10 Gbps link, 40 seconds to warm in memory, total 55 seconds before serving traffic
2Rollout coordination: Service prefetches new model sha256:b4e3... in background while serving sha256:a3f2..., flips pointer when ready, keeps old model loaded for 20 minutes
3Scale example: 500 models, 5000 total versions, 200 writes per day during retraining, 3000 reads per minute during deploy window across 1000 service instances
← Back to Model Registry Overview
Production Model Registry Architecture and Scale Requirements | Model Registry - System Overflow