ML Infrastructure & MLOpsModel RegistryHard⏱️ ~3 min

Model Registry Failure Modes and Mitigation Strategies

Critical Failure Mode
Model-code skew is the most dangerous failure: serving code expects features the model was not trained on, causing silent accuracy degradation or hard crashes.

MODEL-CODE SKEW

Service upgrades feature extraction (adds fields, changes preprocessing) but the registry still points to an older model. Example: fraud service adds geolocation features in v2.0 but loads a model trained without them—precision drops from 0.89 to 0.72. Mitigation: store a model signature (feature names, types, preprocessing version). Block promotion if schema mismatches.

STALE POINTERS

Eventual consistency in metadata or cache causes two systems to read different versions. One canary loads v1.24 while others remain on v1.23, invalidating A/B tests. Mitigation: optimistic locking with version fields, event-driven propagation instead of polling, short cache TTLs (30-60s) during rollouts.

⚠️ Warning: Never bind model versions to stages alone. Bind to application releases for strongest guarantees—both code and model deploy together.

REGISTRY OUTAGE

Training finishes but registry control plane is down—registration stalls, deployments partially update. Mitigation: serving continues using last resolved version (hot path never queries registry). Registration writes to durable queue with retry logic. Write-ahead logs ensure no metadata loss.

COLD STARTS

New model promoted but artifacts not replicated to all regions. Distant instances fail to download within 60s timeout and crash-loop. Mitigation: prestage artifacts in regional caches before flipping pointer. Delay promotion until health checks confirm replicas ready. Fallback to cached older version if download fails.

💡 Key Takeaways
Model code skew occurs when service upgrades feature schema but loads old model, causing silent accuracy degradation from 0.89 to 0.72 precision
Stale pointers from eventual consistency cause different instances to load different versions, invalidating A/B test results and user experience
Registry outage stalls deployments if control plane is on critical path, mitigation caches last resolved version and uses durable queues for events
Artifact unavailability in distant regions causes instances to crash loop when download exceeds 60 second startup timeout
Metric gaming with offline evaluation looks good due to leakage but online Key Performance Indicators (KPIs) degrade, requires statistically sound canary tests
Rollback mismanagement leaves bad model serving if instances do not reload or cache TTL is too long, need forced invalidate signal
📌 Interview Tips
1Model code skew: Fraud detection service adds merchant_category feature, model trained without it receives null values, false positive rate jumps 15%
2Stale pointer: Eventual consistency lag of 5 seconds causes 20 out of 200 instances to load v1.24 while others stay on v1.23 during canary window
3Artifact unavailability: New 1.5 GB model promoted but not replicated to Asia Pacific region, 50 instances fail to start within 60 second timeout
4Rollback failure: Registry pointer flipped back to v1.23 but 300 second cache TTL means instances continue serving bad v1.24 for 4 more minutes
← Back to Model Registry Overview
Model Registry Failure Modes and Mitigation Strategies | Model Registry - System Overflow