What is Tokenization and Why Does It Matter?
Why Not Just Split on Spaces
The naive approach: split "I love cats" into ["I", "love", "cats"] and assign each word an ID. This fails for three reasons. First, vocabulary explodes. English has 170,000+ words, plus misspellings, slang, and technical terms. A word-level vocabulary easily exceeds 1 million entries.
Second, out of vocabulary (OOV) words break the system. If "cryptocurrency" is not in your vocabulary, the model cannot process it. You could add an [UNK] token for unknowns, but then the model loses all information about that word.
Third, morphologically rich languages like German or Turkish create compound words that word-level tokenization cannot handle. "Donaudampfschifffahrtsgesellschaft" (Danube steamship company) would require a vocabulary entry for every possible compound.
Subword Tokenization
Modern tokenizers split text into subword units. "Unhappiness" becomes ["un", "happiness"] or ["un", "hap", "pi", "ness"] depending on the algorithm. Common words stay whole; rare words decompose into smaller pieces.
This keeps vocabulary manageable (32,000-100,000 tokens) while handling any input. Even completely novel words decompose into known subwords. "Cryptocurrency" becomes ["crypto", "currency"] and the model can infer meaning from the parts.
Common Algorithms