What is Tokenization and Why Does It Matter?

Definition
Tokenization is the process of splitting text into smaller units (tokens) that a model can process. A model cannot read raw text. It needs text converted into numerical IDs from a fixed vocabulary. Tokenization bridges the gap between human-readable text and model-processable numbers.
Why Not Just Split on Spaces
The naive approach: split "I love cats" into ["I", "love", "cats"] and assign each word an ID. This fails for three reasons. First, vocabulary explodes. English has 170,000+ words, plus misspellings, slang, and technical terms. A word-level vocabulary easily exceeds 1 million entries.
Second, out of vocabulary (OOV) words break the system. If "cryptocurrency" is not in your vocabulary, the model cannot process it. You could add an [UNK] token for unknowns, but then the model loses all information about that word.
Third, morphologically rich languages like German or Turkish create compound words that word-level tokenization cannot handle. "Donaudampfschifffahrtsgesellschaft" (Danube steamship company) would require a vocabulary entry for every possible compound.
Subword Tokenization
Modern tokenizers split text into subword units. "Unhappiness" becomes ["un", "happiness"] or ["un", "hap", "pi", "ness"] depending on the algorithm. Common words stay whole; rare words decompose into smaller pieces.
This keeps vocabulary manageable (32,000-100,000 tokens) while handling any input. Even completely novel words decompose into known subwords. "Cryptocurrency" becomes ["crypto", "currency"] and the model can infer meaning from the parts.
Common Algorithms
💡 Three Main Approaches: BPE (Byte Pair Encoding) merges frequent character pairs iteratively. WordPiece uses likelihood maximization. SentencePiece operates on raw bytes, handling any language without preprocessing.

💡 Key Takeaways

✓Tokenization converts human text into numerical IDs from a fixed vocabulary that models can process

✓Word-level tokenization fails: vocabulary explodes (1M+ words), OOV words lose information, compounds cannot be handled

✓Subword tokenization splits rare words into pieces while keeping common words whole

✓Typical vocabulary size is 32,000-100,000 tokens, covering any possible input

✓Main algorithms: BPE (merge frequent pairs), WordPiece (likelihood), SentencePiece (raw bytes)

📌 Interview Tips

1Explain why word-level tokenization fails: vocabulary explosion, OOV words, compound words

2Show subword example: 'cryptocurrency' becomes ['crypto', 'currency'] so model can infer meaning

3Mention vocabulary sizes: 32K for efficient models, 100K for multilingual coverage

← Back to Tokenization & Preprocessing Overview