Loading...
LLM & Generative AI Systems • RAG Architecture (Retrieval Augmented Generation)Hard⏱️ ~3 min
RAG Failure Modes and Production Challenges
The Fundamental Constraint:
RAG's generation quality is upper bounded by retrieval quality. Even the most capable LLM cannot produce accurate answers if the retrieval system returns irrelevant, incomplete, or incorrect documents. This dependency creates several critical failure modes that interviewers love to probe.
Mitigations include smarter chunk selection with relevance thresholds, summarization of retrieved documents before insertion, and query specific compression techniques. Some systems use a smaller model to extract only relevant sentences from chunks rather than including entire chunks.
Security and Multi Tenancy:
Vector indexes leak information in subtle ways. If your index does not enforce access control lists (ACLs) at retrieval time, a user in Sales might retrieve embeddings for confidential Engineering documents. Even without returning raw text, the embedding proximity itself reveals information: "your query is very similar to this document you cannot access" tells the user something.
Production systems at companies like Microsoft apply per document ACL filters during vector search, filtering out results the user cannot access before re ranking. This requires the vector database to support efficient metadata filtering without scanning all vectors. For strong guarantees, some organizations maintain physically isolated indexes per tenant, trading increased storage and operational cost for security.
Staleness and Embedding Drift:
If ingestion runs nightly, a critical policy change at 10 AM is invisible to users until tomorrow. Real time indexing reduces lag to minutes or seconds but increases resource usage by 3 to 5x and adds complexity around eventual consistency.
Upgrading embedding models invalidates existing indexes. Mixing embeddings from
❗ Remember: If the right document is not retrieved, the system will hallucinate, fabricate connections between unrelated text, or confidently state "I don't know" when the answer exists in your corpus.
Retrieval Quality Failures:
Poor embeddings cause semantic mismatches. If your embedding model was trained on general web text but you are searching medical literature with specialized terminology, semantically similar medical concepts may map to distant vectors. A query about "myocardial infarction" might miss documents about "heart attack" if the model does not understand they are synonyms.
Wrong chunking granularity breaks context. Imagine a document that says "Product X is safe for users over 18" chunked into "Product X is safe" (chunk 1) and "for users over 18" (chunk 2). Retrieving only chunk 1 leads to dangerous misinformation. This happens frequently with 200 to 300 token chunks where critical caveats fall into adjacent chunks that do not get retrieved.
Query formulation issues compound the problem. User queries are often vague, ambiguous, or use different vocabulary than documents. "How do I ship code?" might mean deployment pipelines, code review process, or version control workflows. Without query understanding or expansion, retrieval returns irrelevant results.
Context Overflow and Truncation:
A 32,000 token context limit seems generous until you account for system prompts (500 to 1000 tokens), conversation history (2000 to 5000 tokens for multi turn chats), and instructions (500 tokens). That leaves roughly 24,000 tokens for retrieved documents. If you naively insert 40 chunks of 1000 tokens each, you exceed the limit.
Most systems silently truncate, keeping the first N chunks. This means later retrieved documents, which might contain critical information, get dropped. The LLM generates answers citing wrong sections or missing important caveats because it never saw the relevant context.
Context Budget Example
32K
TOTAL LIMIT
8K
OVERHEAD
24K
FOR RETRIEVAL
text-embedding-ada-002 and text-embedding-3-large in one index breaks vector similarity: distances are no longer comparable. The solution is versioned indexes and batch re embedding, but during migration you see degraded relevance for days. With 200 million embeddings, re embedding at 1000 vectors per second takes over 2 days.
Evaluation Challenges:
RAG systems produce fluent but subtly wrong answers that are hard to catch. You need metrics for groundedness (does the answer stick to retrieved facts), citation correctness (are citations accurate), and coverage (did retrieval find all relevant sources). Human labeled test sets with 500 to 1000 examples provide the gold standard, but automatic heuristics like checking for unsupported claims or hallucinated citations catch many issues in production.💡 Key Takeaways
✓Retrieval quality strictly bounds generation quality: wrong chunking, poor embeddings, or bad query formulation cause hallucinations regardless of LLM capability
✓Context overflow with 32K token limits leaves only 24K tokens after overhead, requiring smart chunk selection or summarization to avoid silent truncation
✓Multi tenancy requires ACL enforcement at retrieval time; embeddings leak information even without returning raw text, necessitating per tenant indexes for strong security
✓Embedding model upgrades invalidate indexes: re embedding 200 million vectors at 1000 per second takes over 2 days with degraded relevance during migration
✓Evaluation needs human labeled test sets (500 to 1000 examples) plus automatic groundedness and citation correctness checks to catch subtle errors
📌 Examples
1Medical RAG system: Query "treatment for chest pain" retrieves generic wellness advice but misses critical document about emergency cardiac protocols due to embedding model trained on general text, not medical terminology, resulting in dangerous answer
2Enterprise assistant: Document stating "Feature X requires approval for deployments over 1000 users" chunked into separate pieces, only first chunk retrieved, system tells user "Feature X requires approval" without the threshold, causing incorrect process
3Multi tenant SaaS: Sales team member searches for "Q4 strategy", vector search returns Engineering roadmap embeddings with high similarity, revealing confidential information about unannounced features despite not returning document text
Loading...