Learn→Natural Language Processing Systems→Tokenization & Preprocessing→4 of 6

Natural Language Processing Systems • Tokenization & PreprocessingMedium⏱️ ~2 min

Vocabulary Size Trade-offs and Sequence Length Impact

Vocabulary Size Trade-offs
Smaller vocabulary (16K-32K tokens): Each token is more common, so the model sees each one more times during training. Better generalization on limited data. But rare words split into many tokens, increasing sequence length. "Unhappiness" might become ["un", "ha", "pp", "i", "ness"] with 5 tokens.
Larger vocabulary (64K-100K tokens): Rare words stay intact as single tokens. Sequence lengths stay shorter, fitting more content in the context window. But each token appears less often in training data, requiring more data for good representations. Also increases memory for the embedding table.
Sequence Length Impact
Language models have fixed context windows: 2048, 4096, 8192 tokens. If your tokenizer produces more tokens per word, you fit fewer words in the context. This directly impacts model capability: a document that fits in context can be summarized; one that does not cannot.
💡 Compression Ratio: Measure tokens per word. English text averages 1.3-1.5 tokens per word with 32K vocabulary, 1.1-1.3 with 100K vocabulary. Code averages 1.5-2.0 tokens per word. A 20% improvement in compression means 20% more content fits in context.
Memory Considerations
The embedding table stores one vector per vocabulary entry. At 768 dimensions with 32-bit floats, each entry is 3KB. A 50K vocabulary needs 150MB just for embeddings. Doubling vocabulary to 100K doubles this to 300MB.
For edge deployment on mobile or IoT devices, this matters. A 16K vocabulary with 256 dimensions fits in 16MB, enabling on-device inference. Production servers rarely care about this difference.
Multilingual Vocabulary
Covering 100+ languages requires larger vocabularies. Each language needs representation. A 32K vocabulary splits unevenly: English might get 15K tokens while Hindi gets 500. Hindi text ends up with 3-4× more tokens per word than English, severely limiting its effective context length.
Multilingual models often use 100K-250K tokens to give adequate coverage. This increases model size but ensures no language is severely penalized on sequence length.

💡 Key Takeaways

✓Smaller vocabulary (16-32K): better generalization but rare words split into many tokens

✓Larger vocabulary (64-100K): rare words stay intact but need more training data per token

✓Compression ratio: English is 1.3-1.5 tokens/word at 32K, 1.1-1.3 at 100K vocabulary

✓Embedding table memory: 50K vocab at 768 dims = 150MB; doubles to 300MB at 100K

✓Multilingual needs 100K+ tokens so no language is penalized on context length

📌 Interview Tips

1Show vocabulary trade-off: 32K gives better generalization, 100K gives shorter sequences

2Explain compression ratio impact: 20% better compression = 20% more content in context

3For multilingual, mention uneven distribution: 32K might give Hindi 3-4× more tokens than English

← Back to Tokenization & Preprocessing Overview