Loading...
LLM & Generative AI Systems • Chunking Strategies & Context Window ManagementHard⏱️ ~3 min
Advanced: Hierarchical Retrieval and Multi Stage Context
Beyond Flat Chunking:
Flat chunking treats every document as a sequence of independent chunks, but many real world corpora have rich structure: books have chapters and sections, codebases have directories and imports, legal documents have hierarchical clauses. Hierarchical retrieval exploits this structure to improve both precision and context completeness.
The core idea is to chunk at multiple granularities simultaneously. A design document might be chunked as: the full document (5,000 tokens), each major section (800 to 1,200 tokens), and individual paragraphs (200 to 400 tokens). All three levels are embedded and indexed. At query time, the system can retrieve at the appropriate level: paragraph chunks for precise facts, section chunks for broader context, or the full document when the query is exploratory.
How Multi Level Retrieval Works:
Suppose a user asks "What were the performance goals for the 2024 roadmap?" A single level system retrieves paragraph chunks mentioning performance and 2024, but might miss the executive summary that contextualizes those goals. A hierarchical system retrieves both: the executive summary chunk (document level) and the specific performance metrics paragraph (paragraph level).
The implementation uses a two pass retrieval strategy. First pass retrieves candidates at all levels: maybe 50 paragraph chunks, 20 section chunks, and 5 full document chunks. Second pass re ranks them together and selects a diverse set. The key is the context budget allocation: you might spend 10k tokens on full document chunks (providing broad context), 30k tokens on section chunks (providing detailed coverage), and 10k tokens on paragraph chunks (providing specific facts).
When Hierarchical Retrieval Wins:
This approach shines for complex questions that require both high level understanding and specific details. For example, debugging questions in technical documentation: "Why is my API call failing with error 403?" benefits from both the high level authentication architecture doc (to understand the auth flow) and the specific error code reference (to see what 403 means in this context).
The trade off is operational complexity. You now maintain three indexes instead of one, tripling storage and ingestion cost. Retrieval latency increases slightly because you query multiple indexes, though parallel execution keeps the overhead to 10 to 20 ms. Re ranking across different chunk sizes is also tricky: how do you compare the relevance of a 300 token paragraph to a 5,000 token full document? Most systems normalize scores by chunk size and apply learned weights.
Parent Document Retrieval:
A simpler variant is parent document retrieval. You chunk and embed at fine granularity (for example, 200 token paragraphs for precise matching), but at retrieval time you return the entire parent document or section instead of just the matched chunk. This gives the model much more context than the specific paragraph that matched.
For example, a 10 page incident report is chunked into 50 paragraphs. The query "What was the root cause?" matches paragraph 23. Instead of returning just that paragraph, the system returns the entire "Root Cause Analysis" section (2,000 tokens). The model gets all supporting details, timeline, and evidence that surround the specific sentence that matched.
The cost is token budget. If you return 10 parent documents at 2,000 tokens each, you consume 20k tokens. This only works when parent documents are reasonably sized (under 2,000 to 3,000 tokens) and queries are specific enough that you do not need many parents. For broad exploratory queries, parent retrieval wastes tokens on mostly irrelevant context.
Recursive Retrieval for Code and APIs:
For codebases, a specialized form of hierarchical retrieval follows import and dependency graphs. When a user asks about a function, the system retrieves the function definition chunk, then recursively retrieves chunks for any functions or classes it imports, then retrieves configuration or data schemas those depend on.
This can quickly explode: a single function might transitively import 20 other files. The solution is depth limiting and relevance filtering. Set a maximum depth (for example, 2 levels of imports) and at each level only follow the top 3 to 5 most relevant dependencies based on embedding similarity to the original query. This keeps total chunks bounded while still providing critical context that naive flat retrieval would miss.
1
Ingestion: Chunk document at three levels: full doc (5k tokens), sections (1k tokens), paragraphs (300 tokens). Embed and index all levels with parent/child links.
2
Retrieval: Query all three indexes simultaneously. Retrieve 50 paragraphs, 20 sections, 5 full docs in parallel (50 to 80 ms total).
3
Reranking: Score all candidates together. Use cross encoder to identify best 2 full docs, 8 sections, 10 paragraphs based on relevance and diversity.
4
Assembly: Arrange chunks hierarchically in context: full docs first (broad context), then sections, then paragraphs. Total 50k tokens across all levels.
Context Allocation: Hierarchical vs Flat
3 levels
HIERARCHICAL
50k tokens
MIXED GRANULARITY
+15% quality
VS FLAT
✓ In Practice: Large companies like Google and Meta use hierarchical retrieval for internal documentation systems serving tens of thousands of engineers. The 10 to 15 percent improvement in answer quality for complex technical queries justifies the 2x to 3x increase in infrastructure cost, because engineer productivity gains far exceed the compute spend.
💡 Key Takeaways
✓Hierarchical retrieval chunks at multiple granularities (full docs, sections, paragraphs) and retrieves at the appropriate level, improving quality by 10 to 15 percent for complex queries
✓Multi level context allocation spends tokens strategically: 10k on full documents for broad understanding, 30k on sections for details, 10k on paragraphs for specific facts
✓Parent document retrieval embeds fine grain chunks (200 tokens) for precision but returns entire parent sections (2,000 tokens) for context completeness
✓Recursive retrieval for code follows import graphs with depth limiting (2 levels max) and relevance filtering (top 3 to 5 dependencies per level) to avoid context explosion
✓Operational cost triples storage and ingestion when maintaining three index levels, justified only when answer quality gains translate to measurable business value
📌 Examples
1Technical documentation systems retrieve both high level architecture docs (5k tokens) and specific error code references (300 tokens) for debugging queries, providing comprehensive context
2Code Q&A systems retrieve a function definition (400 tokens), its imported utilities (800 tokens), and relevant config schemas (600 tokens), following dependency graph with depth limit of 2
3Google and Meta internal assistants use hierarchical retrieval for 50k+ engineers, accepting 2x to 3x infrastructure cost for 15 percent quality improvement on complex technical queries
Loading...