Failure Modes: When Chunking Breaks
document_id.section_2.subsection_a) and always retrieve adjacent chunks when a high confidence match occurs. Some systems use a two pass retrieval: first find the best chunk, then automatically pull the chunk immediately before and after to capture context. This triples retrieval cost but prevents silent failures.
Context Flooding: Death by a Thousand Chunks
The opposite failure occurs when retrieval is too aggressive or chunks are too large. Suppose your system retrieves 50 chunks of 1,000 tokens each, consuming 50k of your 128k budget. The model receives a flood of semi relevant text and starts anchoring on spurious details buried deep in the context. This manifests as hallucinations where the model combines facts from unrelated chunks, or contradictions where the answer shifts based on which chunk the model happened to focus on. Users report that the same question gives different answers on successive attempts. Latency also spikes: processing 50k extra tokens adds 200 to 400 ms to inference time. The fix is aggressive re ranking and diversity filtering. After initial retrieval of 100 to 200 candidates, use a cross encoder or small LLM to score relevance, then apply maximal marginal relevance to select a diverse set of 10 to 20 chunks. This keeps context focused on distinct perspectives rather than repetitive near duplicates.
Structured Data and Cross Document References
Tables, code, and API documentation frequently depend on non contiguous information. A code snippet might reference an import at line 1, use a function defined in a separate file, and require configuration from a third file. Simple contiguous chunking will either split these apart or create a massive chunk that spans multiple files. A concrete example: an API endpoint documentation chunk describes parameters but the authentication scheme is documented in a separate security section. When a developer asks "how do I authenticate this endpoint?", retrieval returns the endpoint chunk but misses the auth chunk because they are semantically distant. The model hallucinates a plausible but incorrect authentication method. The solution is explicit linking during chunking. When parsing code or structured docs, create bidirectional links between related chunks (imports, function definitions, configuration references). At retrieval time, follow these links to pull in dependencies even if they do not match the query directly. This requires custom parsing per document type but prevents silent failures in technical documentation.
Scale Induced Index Degradation
At billion chunk scale, naive overlapping and semantic chunking can cause operational failures. Suppose you chunk 10 million documents into an average of 50 chunks each with 25 percent overlap. That is 625 million vectors. When the corpus grows 10x to 100 million documents, you have 6.25 billion vectors. Most vector databases use approximate nearest neighbor (ANN) search algorithms like HNSW or IVF that trade accuracy for speed. As index size grows, recall degrades: you might retrieve the true top 20 chunks only 70 to 80 percent of the time instead of 95+ percent. This manifests as inconsistent answer quality where the same question sometimes gets great answers and sometimes misses obvious relevant documents. The mitigation is either sharding the index by document type or access domain, or investing in more expensive exact search for the final re ranking stage. You might use ANN to narrow 6 billion chunks to 1,000 candidates in 30 ms, then use exact search over those 1,000 to pick the final 20 in another 20 ms. This hybrid approach keeps total latency acceptable while maintaining high recall.