Advanced: Hierarchical Retrieval and Multi Stage Context
How Multi Level Retrieval Works
Suppose a user asks "What were the performance goals for the 2024 roadmap?" A single level system retrieves paragraph chunks mentioning performance and 2024, but might miss the executive summary that contextualizes those goals. A hierarchical system retrieves both: the executive summary chunk (document level) and the specific performance metrics paragraph (paragraph level). The implementation uses a two pass retrieval strategy. First pass retrieves candidates at all levels: maybe 50 paragraph chunks, 20 section chunks, and 5 full document chunks. Second pass re ranks them together and selects a diverse set. The key is the context budget allocation: you might spend 10k tokens on full document chunks (providing broad context), 30k tokens on section chunks (providing detailed coverage), and 10k tokens on paragraph chunks (providing specific facts).
When Hierarchical Retrieval Wins
This approach shines for complex questions that require both high level understanding and specific details. For example, debugging questions in technical documentation: "Why is my API call failing with error 403?" benefits from both the high level authentication architecture doc (to understand the auth flow) and the specific error code reference (to see what 403 means in this context). The trade off is operational complexity. You now maintain three indexes instead of one, tripling storage and ingestion cost. Retrieval latency increases slightly because you query multiple indexes, though parallel execution keeps the overhead to 10 to 20 ms. Re ranking across different chunk sizes is also tricky: how do you compare the relevance of a 300 token paragraph to a 5,000 token full document? Most systems normalize scores by chunk size and apply learned weights.
Parent Document Retrieval
A simpler variant is parent document retrieval. You chunk and embed at fine granularity (for example, 200 token paragraphs for precise matching), but at retrieval time you return the entire parent document or section instead of just the matched chunk. This gives the model much more context than the specific paragraph that matched. For example, a 10 page incident report is chunked into 50 paragraphs. The query "What was the root cause?" matches paragraph 23. Instead of returning just that paragraph, the system returns the entire "Root Cause Analysis" section (2,000 tokens). The model gets all supporting details, timeline, and evidence that surround the specific sentence that matched. The cost is token budget. If you return 10 parent documents at 2,000 tokens each, you consume 20k tokens. This only works when parent documents are reasonably sized (under 2,000 to 3,000 tokens) and queries are specific enough that you do not need many parents. For broad exploratory queries, parent retrieval wastes tokens on mostly irrelevant context.
Recursive Retrieval for Code and APIs
For codebases, a specialized form of hierarchical retrieval follows import and dependency graphs. When a user asks about a function, the system retrieves the function definition chunk, then recursively retrieves chunks for any functions or classes it imports, then retrieves configuration or data schemas those depend on. This can quickly explode: a single function might transitively import 20 other files. The solution is depth limiting and relevance filtering. Set a maximum depth (for example, 2 levels of imports) and at each level only follow the top 3 to 5 most relevant dependencies based on embedding similarity to the original query. This keeps total chunks bounded while still providing critical context that naive flat retrieval would miss.