Natural Language Processing Systems • Tokenization & PreprocessingEasy⏱️ ~2 min
What is Tokenization and Why Does It Matter?
Tokenization converts raw text into discrete units that machine learning models can process. Instead of feeding models raw characters or entire words, we break text into tokens, which can be words, subwords, characters, or bytes. This seemingly simple step is actually critical because it determines what patterns the model can learn and how efficiently it processes language.
Modern systems overwhelmingly use subword tokenization methods like Byte Pair Encoding (BPE), WordPiece, and Unigram. These methods strike a balance between vocabulary size and coverage. Google BERT uses WordPiece with roughly 30,000 tokens. OpenAI GPT models use byte level BPE with about 50,000 tokens. Meta LLaMA uses 32,000 tokens optimized for multilingual content. The vocabulary is learned from training data and then frozen, becoming a versioned artifact that ships with the model.
Why not just use words? Word tokenization leads to massive vocabularies (hundreds of thousands of words) and out of vocabulary (OOV) failures when encountering new terms. Why not characters? Character tokenization eliminates OOV but creates extremely long sequences. Since transformer attention costs scale roughly quadratically with sequence length, a 100 word sentence becomes 500+ character tokens instead of 120 subword tokens. That's a 4x computational increase for the same content.
The tokenizer is part of the model contract. A string must map to the same token sequence in training, indexing, and inference. Any mismatch causes quality degradation or breaks tasks like named entity recognition that depend on span alignment. This is why production systems version tokenizers like code and refuse to serve when versions mismatch.
💡 Key Takeaways
•Subword tokenization balances vocabulary size and sequence length. BERT uses 30K tokens, GPT uses 50K, keeping vocabularies tractable while avoiding out of vocabulary failures.
•Tokenizers are versioned artifacts frozen at training time. The same vocabulary and merge rules must be used everywhere to prevent train serve skew and quality degradation.
•Sequence length impacts compute quadratically in transformers. A 2KB text becomes roughly 120 to 150 tokens with subword methods versus 500+ character tokens, making subword 4x more efficient.
•The tokenizer is part of the model contract. Any drift between indexing, training, and inference breaks correctness for tasks requiring span alignment like named entity recognition.
📌 Examples
Google BERT uses WordPiece with 30,000 tokens for English, splitting "unhappiness" into ["un", "##hap", "##pi", "##ness"] to handle morphology without exploding vocabulary size.
OpenAI GPT models use byte level BPE with 50,000 tokens, preserving exact Unicode including emoji and code. The word "tokenization" might become ["token", "ization"] or ["tok", "en", "ization"] depending on training data frequency.
A system processing 100K documents per minute at 2KB each generates 250K tokens per second. With 100K tokens per second per core throughput, you need 3 cores just for tokenization to keep up with ingestion.