What is Chunking in LLM Systems?
The Core Problem
Modern LLMs can only process a finite amount of text in a single request. Even the largest production models accept between 4,000 and 200,000 tokens per call. Your enterprise knowledge base with millions of documents and billions of tokens cannot possibly fit. A concrete example: suppose you have a 10,000 page employee handbook with 5 million tokens total. When an employee asks "What's the parental leave policy?", the system needs to show the LLM only the relevant 2 to 5 pages out of those 10,000. Chunking prepares the handbook so the system can quickly find and retrieve just those relevant sections.
How It Works
The chunking pipeline runs offline during document ingestion. A 2,000 token design document might be split into 4 to 6 chunks of 300 to 600 tokens each. Each chunk becomes a searchable unit: it gets converted to a vector embedding and stored in a database alongside metadata like document ID, section title, and timestamps. At query time, the system embeds the user's question, searches the chunk database for the most relevant 10 to 40 chunks, and assembles them into the context window along with instructions and conversation history. The LLM then generates an answer based on only those retrieved chunks, not the entire corpus.
Why This Matters
Without chunking, you face an impossible choice: either send entire documents (wasting tokens and money on irrelevant content) or send nothing (the model has no information to work with). Chunking lets you balance precision, cost, and answer quality by retrieving just enough context for the model to succeed.