Natural Language Processing Systems • Tokenization & PreprocessingMedium⏱️ ~2 min
Vocabulary Size Trade-offs and Sequence Length Impact
Vocabulary size is a fundamental design choice that affects sequence length, model size, training dynamics, and inference cost. Larger vocabularies compress text into fewer tokens, reducing attention compute, but they inflate embedding and output projection size and complicate training data coverage.
The math is straightforward. With embedding dimension 768 and vocabulary size 50,000, the embedding table contains 38.4 million parameters. In float16, that's 76.8MB. A 32,000 vocabulary drops this to 49.2MB, saving 27MB per model copy. The output projection, which maps hidden states back to vocabulary logits, has the same size. For models with billions of parameters, embedding and output layers can represent 5 to 10 percent of total size.
Sequence length impact is more dramatic. Transformer attention scales roughly quadratically with sequence length. A 2KB English document becomes about 512 character tokens but only 120 to 150 subword tokens with BPE. That's a 3.4x to 4.2x reduction in sequence length, which translates to roughly 12x to 18x reduction in attention compute. For a model with 32 attention layers, this difference dominates inference cost.
The trade-off shifts with context length. For models with 4,096 token max length, a 30,000 to 50,000 vocabulary keeps most documents under the limit. For very long context models with 128,000 token windows, teams sometimes push vocabularies to 100,000 to compress sequences further and keep more content within the window. However, larger vocabularies see more sparse statistics during training. Rare tokens may appear only hundreds of times in the training corpus, leading to poor embeddings and higher perplexity on tail vocabulary.
💡 Key Takeaways
•Vocabulary size directly impacts model size. A 50K vocabulary with 768 dimensions uses 77MB for embeddings plus another 77MB for output projection. A 32K vocabulary saves 56MB total per model copy.
•Sequence length dominates transformer cost. Reducing 2KB text from 512 character tokens to 120 subword tokens cuts attention compute by roughly 12x to 18x across all layers.
•Larger vocabularies compress sequences better but create sparse statistics. Rare tokens in a 100K vocabulary may appear only hundreds of times in training, leading to poor embeddings and higher perplexity.
•Context window trade-offs change the calculus. For 128K token context models, larger vocabularies (up to 100K) help fit more content within the window, justifying the sparsity and size costs.
•Multilingual models benefit from larger vocabularies. Meta LLaMA uses 32K tokens to balance many languages. Pure English models can use smaller vocabularies (20K to 30K) without losing compression.
📌 Examples
Google BERT uses a 30K WordPiece vocabulary for English, balancing embedding size (23M parameters at 768 dim) against sequence length for 512 token max inputs. This keeps embeddings under 50MB in float16.
OpenAI GPT-3 uses 50K byte level BPE tokens. For a 2KB document, this yields roughly 120 tokens versus 512 characters. With 96 attention layers, the quadratic savings (4x length reduction = 16x compute reduction per layer) justify the larger embedding table.
Meta LLaMA 2 uses a 32K multilingual vocabulary. This increases embedding size versus English only 20K vocab, but reduces out of vocabulary issues and sequence length for non whitespace languages like Chinese, improving overall efficiency across 20+ languages.