Learn→LLM & Generative AI Systems→Chunking Strategies & Context Window Management→5 of 6

LLM & Generative AI Systems • Chunking Strategies & Context Window ManagementHard⏱️ ~3 min

Failure Modes: When Chunking Breaks

Partial Context: The Plausible but Wrong Answer:
The most dangerous failure mode is when chunking splits critical qualifiers from their associated rules, producing answers that sound confident but violate exceptions. Consider a compliance policy that states a rule in paragraph one and lists three exceptions in paragraph two. Fixed length chunking at 400 tokens might split these into separate chunks.

At retrieval time, the system returns only the rule chunk because it matches the query. The model generates an answer based on the rule alone, completely missing the exceptions. This is catastrophic in regulated domains: a financial services chatbot might tell a customer they are eligible for a loan when an exception actually disqualifies them. The answer is fluent, specific, and completely wrong.

The mitigation is hierarchical metadata. Tag each chunk with its section path (for example, document_id.section_2.subsection_a) and always retrieve adjacent chunks when a high confidence match occurs. Some systems use a two pass retrieval: first find the best chunk, then automatically pull the chunk immediately before and after to capture context. This triples retrieval cost but prevents silent failures.
Context Flooding: Death by a Thousand Chunks
The opposite failure occurs when retrieval is too aggressive or chunks are too large. Suppose your system retrieves 50 chunks of 1,000 tokens each, consuming 50k of your 128k budget. The model receives a flood of semi relevant text and starts anchoring on spurious details buried deep in the context.

This manifests as hallucinations where the model combines facts from unrelated chunks, or contradictions where the answer shifts based on which chunk the model happened to focus on. Users report that the same question gives different answers on successive attempts. Latency also spikes: processing 50k extra tokens adds 200 to 400 ms to inference time.

The fix is aggressive re ranking and diversity filtering. After initial retrieval of 100 to 200 candidates, use a cross encoder or small LLM to score relevance, then apply maximal marginal relevance to select a diverse set of 10 to 20 chunks. This keeps context focused on distinct perspectives rather than repetitive near duplicates.
Retrieval Cascade
VECTOR SEARCH
200 candidates
→
CROSS ENCODER
50 scored
→
DIVERSITY FILTER
15 final
Structured Data and Cross Document References
Tables, code, and API documentation frequently depend on non contiguous information. A code snippet might reference an import at line 1, use a function defined in a separate file, and require configuration from a third file. Simple contiguous chunking will either split these apart or create a massive chunk that spans multiple files.

A concrete example: an API endpoint documentation chunk describes parameters but the authentication scheme is documented in a separate security section. When a developer asks "how do I authenticate this endpoint?", retrieval returns the endpoint chunk but misses the auth chunk because they are semantically distant. The model hallucinates a plausible but incorrect authentication method.

The solution is explicit linking during chunking. When parsing code or structured docs, create bidirectional links between related chunks (imports, function definitions, configuration references). At retrieval time, follow these links to pull in dependencies even if they do not match the query directly. This requires custom parsing per document type but prevents silent failures in technical documentation.
Scale Induced Index Degradation
At billion chunk scale, naive overlapping and semantic chunking can cause operational failures. Suppose you chunk 10 million documents into an average of 50 chunks each with 25 percent overlap. That is 625 million vectors. When the corpus grows 10x to 100 million documents, you have 6.25 billion vectors.

Most vector databases use approximate nearest neighbor (ANN) search algorithms like HNSW or IVF that trade accuracy for speed. As index size grows, recall degrades: you might retrieve the true top 20 chunks only 70 to 80 percent of the time instead of 95+ percent. This manifests as inconsistent answer quality where the same question sometimes gets great answers and sometimes misses obvious relevant documents.

The mitigation is either sharding the index by document type or access domain, or investing in more expensive exact search for the final re ranking stage. You might use ANN to narrow 6 billion chunks to 1,000 candidates in 30 ms, then use exact search over those 1,000 to pick the final 20 in another 20 ms. This hybrid approach keeps total latency acceptable while maintaining high recall.
❗ Remember: Access control leakage is a common failure when chunks cross document boundaries with different permissions. If you chunk a directory of files without respecting per file ACLs (Access Control Lists), retrieval might return a chunk that mixes public and confidential content, exposing secrets. Always tag chunks with the strictest ACL of any content they contain and filter at retrieval time.

💡 Key Takeaways

✓Partial context failure occurs when rules and exceptions split across chunks, producing confident but incorrect answers that violate hidden qualifiers in compliance or legal domains

✓Context flooding with 50+ chunks causes hallucinations and contradictions as the model anchors on spurious details, adding 200 to 400 ms latency from processing excess tokens

✓Structured data like code or API docs requires explicit linking during chunking to pull in dependencies (imports, auth schemes, config) that are semantically distant from the query

✓Billion chunk scale degrades ANN (Approximate Nearest Neighbor) recall from 95+ percent to 70 to 80 percent, causing inconsistent answer quality; hybrid exact search over top candidates restores precision

✓Access control leakage happens when chunks cross document boundaries with different ACLs (Access Control Lists), exposing confidential content; always tag chunks with strictest permission and filter at retrieval

📌 Interview Tips

1Financial chatbot splits loan eligibility rules from exception list, tells disqualified customer they are approved, violating compliance

2API documentation retrieves endpoint description but misses authentication section in separate chunk, model hallucinates incorrect auth method

310 million document corpus with 50 chunks each and 25 percent overlap creates 625 million vectors; 10x growth to 6.25 billion vectors degrades recall by 15 to 25 percent without infrastructure changes

← Back to Chunking Strategies & Context Window Management Overview