Context Window Management at Scale
Token Budgeting in Practice
Production systems explicitly allocate token budgets before making LLM calls. A typical breakdown for a 128k context model serving an internal assistant might look like this: 4k tokens for system prompts and tool schemas, 2k for the current user question and clarifications, 80k for retrieved document chunks, and 42k for recent conversation history. The math drives architecture decisions. If each chunk averages 500 tokens and you have 80k tokens available, you can theoretically fit 160 chunks. However, most systems cap at 10 to 40 chunks to limit redundancy and reduce the reading burden on the model. More chunks does not always mean better answers: beyond a threshold, the model struggles to synthesize information and may anchor on spurious details buried in the mass of context.
Truncation and Summarization
For long running conversations, naive appending of all messages quickly exhausts the history budget. After 50 to 100 turns, you might have 60k tokens of conversation alone. Systems use several strategies to manage this growth. Rolling truncation drops the oldest messages once you exceed the budget, keeping only the most recent 30 to 50 turns. This works for short sessions but loses important context in multi hour interactions. Selective preservation keeps the first few turns (which often contain critical setup) and last few turns (the immediate context) while removing the middle. Summarization periodically condenses old messages: every 10 to 20 turns, use the LLM itself to generate a 200 to 500 token summary of the conversation so far, then replace those turns with the summary.
Two Tier Memory Architecture
For agents or chatbots that span thousands of turns, some companies implement separate short term and long term memory. Short term memory uses the raw context window with sliding history management. Long term memory stores past interactions as chunks in a vector index, treated identically to document chunks. When a user asks a question, the system retrieves both relevant documents and relevant past conversation segments from the long term index. This keeps per request context bounded (still 128k tokens) while creating the illusion of unlimited memory. The trade off is added complexity: you now maintain two retrieval systems and must tune how many tokens to allocate to each memory tier.