Natural Language Processing SystemsTokenization & PreprocessingEasy⏱️ ~2 min

What is Tokenization and Why Does It Matter?

Definition
Tokenization is the process of splitting text into smaller units (tokens) that a model can process. A model cannot read raw text. It needs text converted into numerical IDs from a fixed vocabulary. Tokenization bridges the gap between human-readable text and model-processable numbers.

Why Not Just Split on Spaces

The naive approach: split "I love cats" into ["I", "love", "cats"] and assign each word an ID. This fails for three reasons. First, vocabulary explodes. English has 170,000+ words, plus misspellings, slang, and technical terms. A word-level vocabulary easily exceeds 1 million entries.

Second, out of vocabulary (OOV) words break the system. If "cryptocurrency" is not in your vocabulary, the model cannot process it. You could add an [UNK] token for unknowns, but then the model loses all information about that word.

Third, morphologically rich languages like German or Turkish create compound words that word-level tokenization cannot handle. "Donaudampfschifffahrtsgesellschaft" (Danube steamship company) would require a vocabulary entry for every possible compound.

Subword Tokenization

Modern tokenizers split text into subword units. "Unhappiness" becomes ["un", "happiness"] or ["un", "hap", "pi", "ness"] depending on the algorithm. Common words stay whole; rare words decompose into smaller pieces.

This keeps vocabulary manageable (32,000-100,000 tokens) while handling any input. Even completely novel words decompose into known subwords. "Cryptocurrency" becomes ["crypto", "currency"] and the model can infer meaning from the parts.

Common Algorithms

💡 Three Main Approaches: BPE (Byte Pair Encoding) merges frequent character pairs iteratively. WordPiece uses likelihood maximization. SentencePiece operates on raw bytes, handling any language without preprocessing.
💡 Key Takeaways
Tokenization converts human text into numerical IDs from a fixed vocabulary that models can process
Word-level tokenization fails: vocabulary explodes (1M+ words), OOV words lose information, compounds cannot be handled
Subword tokenization splits rare words into pieces while keeping common words whole
Typical vocabulary size is 32,000-100,000 tokens, covering any possible input
Main algorithms: BPE (merge frequent pairs), WordPiece (likelihood), SentencePiece (raw bytes)
📌 Interview Tips
1Explain why word-level tokenization fails: vocabulary explosion, OOV words, compound words
2Show subword example: 'cryptocurrency' becomes ['crypto', 'currency'] so model can infer meaning
3Mention vocabulary sizes: 32K for efficient models, 100K for multilingual coverage
← Back to Tokenization & Preprocessing Overview
What is Tokenization and Why Does It Matter? | Tokenization & Preprocessing - System Overflow