ML Model Optimization • Model Caching (Embedding Cache, Result Cache)Hard⏱️ ~3 min
Failure Modes: Cache Stampede, Embedding Drift, and False Positives
Production caching systems fail in predictable ways. Understanding these failure modes and their mitigations separates reliable systems from those that amplify problems under load.
Cache stampede occurs when a viral prompt or cold cache causes many concurrent requests to miss simultaneously and hit the backend. Without request coalescing, a spike from 100 to 10,000 requests per second can overwhelm model capacity. The solution is single flight deduplication. When the first request misses, subsequent identical requests wait for the result rather than all invoking the model. This requires a lightweight lock or promise per cache key. Alternatively, serve stale entries with a short grace period while one request refreshes in the background.
Embedding drift happens when the embedding model or preprocessing changes but the cache is not versioned. Vectors from the old 768 dimensional model do not align with queries from the new 1536 dimensional model. Semantic cache hit rates collapse and false positive rates spike because geometry mismatches cause random high similarity scores. The mitigation is mandatory namespacing by model version and preprocessing version. After an upgrade, old entries become inaccessible and the cache warms naturally. Some teams run dual caches during migration, gradually shifting traffic as the new cache fills.
Semantic false positives are the hardest to detect. Short prompts like hi or thanks score high similarity across unrelated contexts. A question about return policy for shoes might match a cached answer about jacket returns at 0.82 similarity, serving wrong information. Mitigate with minimum prompt length checks (require 5 to 10 words), metadata alignment (same product category, same tenant, same locale), and a lightweight verifier model that checks if the cached answer actually addresses the new prompt. Track verifier disagreements as your false positive metric.
Personal data leakage is catastrophic in multi tenant systems. If tenant identifier is missing from the cache key, one customer sees another's data. Always partition caches by tenant, encrypt at rest, and use access control so cross tenant reads are structurally impossible. For regulated workloads, consider per tenant cache instances rather than shared infrastructure.
💡 Key Takeaways
•Cache stampede on viral prompts or cold start can cause 100x backend load spike. Implement single flight deduplication so one request computes while others wait, or serve stale entries with background refresh.
•Embedding drift after model upgrades destroys cache utility. A change from 768 to 1536 dimensions or different preprocessing causes geometry mismatch, collapsing hit rates and spiking false positives. Always namespace cache entries by model version.
•Semantic false positives are highest on short ambiguous prompts like hi or help. Require minimum 5 to 10 word prompts, enforce metadata alignment (tenant, locale, product category), and run a lightweight verifier to check answer relevance.
•Personal data leakage occurs when tenant identifier is omitted from cache key. One customer receives another's cached data. Always partition by tenant, use access control in cache layer, encrypt at rest, and audit for cross tenant access.
•Stale or unsafe answers can be amplified by result caches. A single hallucination or policy violation gets served repeatedly if cached. Put validators in the write path and require human review for long Time To Live (TTL) entries in sensitive domains.
•Low hit rate on long tail traffic makes global caches ineffective. If 90 percent of prompts are unique, focus on caching embeddings and retrieval results rather than full model responses. Use intent clustering for the head queries.
📌 Examples
A chat system experiences cache stampede during a product launch. A new FAQ prompt receives 8,000 concurrent requests, all miss the cache, and overload the model cluster. Adding single flight dedup limits backend to 1 request while 7,999 wait, resolving in 2 seconds instead of timeout.
After upgrading from text embedding ada 002 to text embedding 3 large, a RAG system without version namespacing sees semantic cache hit rate drop from 22 to 4 percent and false positive rate jump to 15 percent. Adding model version to keys isolates old and new entries, restoring performance.
A multi tenant support bot accidentally omits tenant identifier from the cache key. Customer A's question about premium features returns Customer B's cached answer containing account details. Post incident audit adds mandatory tenant partitioning and access control.