Chunking Trade-offs: When to Choose What
When Small Chunks Win
Use 150 to 300 token chunks when your queries are precise and documents are dense with distinct topics. For example, a medical knowledge base with thousands of drug monographs benefits from small chunks because each query targets a specific drug. Small chunks improve retrieval precision: you get exactly the relevant paragraph without dragging along unrelated sections. The math matters here. With a 32k context window and 200 token chunks, you can fit 100 to 150 chunks after accounting for instructions and history. This diversity helps when the answer requires synthesizing information from many sources. However, small chunks fail catastrophically with cross references. If a legal document says "see section 4.2 for exceptions" and section 4.2 is in a different chunk, the model will miss the exceptions and generate incorrect answers.
When Large Chunks Win
Use 800 to 1,200 token chunks when documents have strong internal dependencies or your queries are exploratory. For example, code documentation that references imports, configuration files, and API contracts in a single explanation needs large chunks to keep everything together. Large chunks also help with narrative documents like design docs or incident reports, where understanding requires reading several paragraphs in sequence. The trade off is reduced diversity: with 128k tokens and 1,000 token chunks, you fit only 80 to 100 chunks after other allocations. You are betting that depth on fewer sources beats breadth across many sources.
Overlap and Its Cost
Overlap is insurance against boundary problems but comes with real infrastructure cost. A 20 percent overlap on 500 million chunks means 100 million extra vectors to store, embed, and search. At 1,536 dimensions per vector and 4 bytes per float, that is 600 GB of additional index data. The decision criteria: use overlap when boundary loss would cause serious errors (legal, medical, financial documents) and you can absorb the cost. Skip overlap for high volume, low stakes corpora like customer support tickets or internal chat logs where occasional boundary loss is acceptable.
Fixed vs Semantic: The Real Trade-off
Fixed length chunking is the default for systems prioritizing operational simplicity and scale. It handles 100 million documents per day without parsing complexity, produces predictable token counts for budgeting, and never fails on malformed input. Use fixed chunking when you have massive throughput requirements or highly variable document quality. Semantic chunking is worth the complexity when answer quality directly impacts business metrics and you can invest in robust parsing infrastructure. The 5 to 15 percent quality improvement matters when you are measuring user satisfaction, support ticket deflection, or compliance accuracy. However, you need to cap maximum chunk size to prevent variable size from breaking budgets.