Loading...
LLM & Generative AI SystemsChunking Strategies & Context Window ManagementEasy⏱️ ~2 min

What is Chunking in LLM Systems?

Definition
Chunking is the process of splitting large documents into smaller, retrievable units that can fit within an LLM's limited context window while preserving enough local context for the model to reason effectively.
The Core Problem: Modern LLMs can only process a finite amount of text in a single request. Even the largest production models accept between 4,000 and 200,000 tokens per call. Your enterprise knowledge base with millions of documents and billions of tokens cannot possibly fit. A concrete example: suppose you have a 10,000 page employee handbook with 5 million tokens total. When an employee asks "What's the parental leave policy?", the system needs to show the LLM only the relevant 2 to 5 pages out of those 10,000. Chunking prepares the handbook so the system can quickly find and retrieve just those relevant sections. How It Works: The chunking pipeline runs offline during document ingestion. A 2,000 token design document might be split into 4 to 6 chunks of 300 to 600 tokens each. Each chunk becomes a searchable unit: it gets converted to a vector embedding and stored in a database alongside metadata like document ID, section title, and timestamps. At query time, the system embeds the user's question, searches the chunk database for the most relevant 10 to 40 chunks, and assembles them into the context window along with instructions and conversation history. The LLM then generates an answer based on only those retrieved chunks, not the entire corpus. Why This Matters: Without chunking, you face an impossible choice: either send entire documents (wasting tokens and money on irrelevant content) or send nothing (the model has no information to work with). Chunking lets you balance precision, cost, and answer quality by retrieving just enough context for the model to succeed.
💡 Key Takeaways
Context windows are limited: even 128k token models cannot hold an entire knowledge base, requiring selective retrieval of relevant sections
Chunks become the unit of retrieval: each chunk is embedded as a vector and stored in a searchable index for fast lookup at query time
Typical chunk sizes range from 150 to 1,000 tokens depending on context window size and how many perspectives you want to fit
Chunking happens offline during ingestion, while retrieval happens online within strict latency budgets of 50 to 100 ms p95
📌 Examples
1A 100 million page internal documentation system chunks each page into 4 to 8 segments, creating 400 to 800 million searchable chunks
2ChatGPT style systems chunk conversation history to keep recent messages within the context window while summarizing or dropping older turns
← Back to Chunking Strategies & Context Window Management Overview
Loading...