RAG Failure Modes and Production Challenges
Retrieval Quality Failures
Poor embeddings cause semantic mismatches. If your embedding model was trained on general web text but you are searching medical literature with specialized terminology, semantically similar medical concepts may map to distant vectors. A query about "myocardial infarction" might miss documents about "heart attack" if the model does not understand they are synonyms. Wrong chunking granularity breaks context. Imagine a document that says "Product X is safe for users over 18" chunked into "Product X is safe" (chunk 1) and "for users over 18" (chunk 2). Retrieving only chunk 1 leads to dangerous misinformation. This happens frequently with 200 to 300 token chunks where critical caveats fall into adjacent chunks that do not get retrieved. Query formulation issues compound the problem. User queries are often vague, ambiguous, or use different vocabulary than documents. "How do I ship code?" might mean deployment pipelines, code review process, or version control workflows. Without query understanding or expansion, retrieval returns irrelevant results.
Context Overflow and Truncation
A 32,000 token context limit seems generous until you account for system prompts (500 to 1000 tokens), conversation history (2000 to 5000 tokens for multi turn chats), and instructions (500 tokens). That leaves roughly 24,000 tokens for retrieved documents. If you naively insert 40 chunks of 1000 tokens each, you exceed the limit. Most systems silently truncate, keeping the first N chunks. This means later retrieved documents, which might contain critical information, get dropped. The LLM generates answers citing wrong sections or missing important caveats because it never saw the relevant context.
Security and Multi Tenancy
Vector indexes leak information in subtle ways. If your index does not enforce access control lists (ACLs) at retrieval time, a user in Sales might retrieve embeddings for confidential Engineering documents. Even without returning raw text, the embedding proximity itself reveals information: "your query is very similar to this document you cannot access" tells the user something. Production systems at companies like Microsoft apply per document ACL filters during vector search, filtering out results the user cannot access before re ranking. This requires the vector database to support efficient metadata filtering without scanning all vectors. For strong guarantees, some organizations maintain physically isolated indexes per tenant, trading increased storage and operational cost for security.
Staleness and Embedding Drift
If ingestion runs nightly, a critical policy change at 10 AM is invisible to users until tomorrow. Real time indexing reduces lag to minutes or seconds but increases resource usage by 3 to 5x and adds complexity around eventual consistency.
Upgrading embedding models invalidates existing indexes. Mixing embeddings from text-embedding-ada-002 and text-embedding-3-large in one index breaks vector similarity: distances are no longer comparable. The solution is versioned indexes and batch re embedding, but during migration you see degraded relevance for days. With 200 million embeddings, re embedding at 1000 vectors per second takes over 2 days.
Evaluation Challenges
RAG systems produce fluent but subtly wrong answers that are hard to catch. You need metrics for groundedness (does the answer stick to retrieved facts), citation correctness (are citations accurate), and coverage (did retrieval find all relevant sources). Human labeled test sets with 500 to 1000 examples provide the gold standard, but automatic heuristics like checking for unsupported claims or hallucinated citations catch many issues in production.