Vocabulary Size Trade-offs and Sequence Length Impact
Vocabulary Size Trade-offs
Smaller vocabulary (16K-32K tokens): Each token is more common, so the model sees each one more times during training. Better generalization on limited data. But rare words split into many tokens, increasing sequence length. "Unhappiness" might become ["un", "ha", "pp", "i", "ness"] with 5 tokens.
Larger vocabulary (64K-100K tokens): Rare words stay intact as single tokens. Sequence lengths stay shorter, fitting more content in the context window. But each token appears less often in training data, requiring more data for good representations. Also increases memory for the embedding table.
Sequence Length Impact
Language models have fixed context windows: 2048, 4096, 8192 tokens. If your tokenizer produces more tokens per word, you fit fewer words in the context. This directly impacts model capability: a document that fits in context can be summarized; one that does not cannot.
Memory Considerations
The embedding table stores one vector per vocabulary entry. At 768 dimensions with 32-bit floats, each entry is 3KB. A 50K vocabulary needs 150MB just for embeddings. Doubling vocabulary to 100K doubles this to 300MB.
For edge deployment on mobile or IoT devices, this matters. A 16K vocabulary with 256 dimensions fits in 16MB, enabling on-device inference. Production servers rarely care about this difference.
Multilingual Vocabulary
Covering 100+ languages requires larger vocabularies. Each language needs representation. A 32K vocabulary splits unevenly: English might get 15K tokens while Hindi gets 500. Hindi text ends up with 3-4× more tokens per word than English, severely limiting its effective context length.
Multilingual models often use 100K-250K tokens to give adequate coverage. This increases model size but ensures no language is severely penalized on sequence length.