Loading...
LLM & Generative AI Systems • Chunking Strategies & Context Window ManagementMedium⏱️ ~3 min
Context Window Management at Scale
The Allocation Problem:
Even after chunking documents into retrievable units, you face a second challenge: deciding how to spend your limited context window tokens across competing needs. A 128k token context window seems large until you realize it must hold system instructions, tool definitions, retrieved document chunks, conversation history, and the user query itself.
Token Budgeting in Practice:
Production systems explicitly allocate token budgets before making LLM calls. A typical breakdown for a 128k context model serving an internal assistant might look like this: 4k tokens for system prompts and tool schemas, 2k for the current user question and clarifications, 80k for retrieved document chunks, and 42k for recent conversation history.
The math drives architecture decisions. If each chunk averages 500 tokens and you have 80k tokens available, you can theoretically fit 160 chunks. However, most systems cap at 10 to 40 chunks to limit redundancy and reduce the reading burden on the model. More chunks does not always mean better answers: beyond a threshold, the model struggles to synthesize information and may anchor on spurious details buried in the mass of context.
Truncation and Summarization:
For long running conversations, naive appending of all messages quickly exhausts the history budget. After 50 to 100 turns, you might have 60k tokens of conversation alone. Systems use several strategies to manage this growth.
Rolling truncation drops the oldest messages once you exceed the budget, keeping only the most recent 30 to 50 turns. This works for short sessions but loses important context in multi hour interactions. Selective preservation keeps the first few turns (which often contain critical setup) and last few turns (the immediate context) while removing the middle. Summarization periodically condenses old messages: every 10 to 20 turns, use the LLM itself to generate a 200 to 500 token summary of the conversation so far, then replace those turns with the summary.
Two Tier Memory Architecture:
For agents or chatbots that span thousands of turns, some companies implement separate short term and long term memory. Short term memory uses the raw context window with sliding history management. Long term memory stores past interactions as chunks in a vector index, treated identically to document chunks.
When a user asks a question, the system retrieves both relevant documents and relevant past conversation segments from the long term index. This keeps per request context bounded (still 128k tokens) while creating the illusion of unlimited memory. The trade off is added complexity: you now maintain two retrieval systems and must tune how many tokens to allocate to each memory tier.
128k Context Budget Example
4k
SYSTEM PROMPTS
80k
RETRIEVED DOCS
42k
CONVERSATION
✓ In Practice: Large scale deployments measure latency budgets in milliseconds. If your total target is 800 ms p95 and retrieval takes 50 ms, embedding takes 20 ms, and the LLM call takes 600 ms, you have only 130 ms of headroom. Summarizing old messages adds 200 to 400 ms per summarization call, so you batch summarizations or run them asynchronously between user turns.
💡 Key Takeaways
✓Explicit token budgets prevent context overflow: allocate tokens to system prompts, documents, and history before making LLM calls rather than appending until failure
✓Practical systems cap retrieved chunks at 10 to 40 even when budget allows 100+ because more context does not always improve answers and increases model confusion
✓Conversation summarization trades 200 to 400 ms of latency per summary for unbounded conversation length, typically triggered every 10 to 20 turns
✓Two tier memory splits recent turns (in raw context) from historical turns (retrieved from vector index), keeping per request tokens bounded while simulating unlimited memory
📌 Examples
1ChatGPT uses rolling summarization to handle conversations spanning dozens of turns without exceeding context limits, periodically condensing old messages
2Internal assistants serving 50k engineers might allocate 80k of 128k tokens to retrieved documents, allowing 20 to 40 chunks at 500 tokens each for comprehensive coverage
Loading...