Natural Language Processing Systems • Tokenization & PreprocessingMedium⏱️ ~2 min
Preprocessing Pipeline: Normalization and Text Cleaning
Preprocessing prepares raw text before tokenization by handling Unicode normalization, case handling, whitespace normalization, and treatment of punctuation, numbers, and special symbols. The golden rule is to make preprocessing minimal and deterministic. Every normalization choice must exactly match how the tokenizer was trained, because even tiny differences change token boundaries and produce different identifiers.
Unicode normalization is critical. The same visual character can have multiple representations. The letter "é" can be a single precomposed character (U+00E9) or a base "e" plus combining acute accent (U+0065 U+0301). Normalization Form C (NFC) composes characters, while NFKC additionally performs compatibility decomposition. Switching from NFC to NFKC on a trained model causes token mismatches. Family emoji using zero width joiners can split unpredictably if normalization is inconsistent.
Modern deep learning models avoid aggressive preprocessing that classical systems used. Stemming ("running" to "run"), lemmatization, and stopword removal destroy signal and break alignment with pretraining. BERT and GPT style models learn from exact strings and latent morphology. The preprocessing is deliberately minimal: consistent Unicode form, whitespace collapse to single spaces, and preservation of punctuation and case. This maximizes fidelity but requires larger training data to cover variations.
The trade-off is between recall and precision. Lowercasing "Apple" and "apple" improves recall for bag of words search, but harms named entity recognition where case distinguishes the company from the fruit. Removing punctuation helps count based models but breaks code understanding where syntax matters. Production systems freeze preprocessing alongside the tokenizer and enforce exact version matching through content hashing of normalization rules.
💡 Key Takeaways
•Unicode normalization must be frozen and consistent. Switching NFC to NFKC or handling zero width joiners differently changes token boundaries, causing train serve skew and quality drops.
•Modern deep models prefer minimal preprocessing. Aggressive stemming, lemmatization, and stopword removal helped classical linear models but degrade transformers that learn from exact strings and morphology.
•Case handling is a critical trade-off. Lowercasing improves recall for search ("Apple" matches "apple") but breaks named entity recognition and code modeling where case carries meaning.
•Preprocessing versioning prevents drift. Store a content hash of normalization rules with tokenizer artifacts and refuse to serve when versions mismatch unless explicitly overridden for migration.
📌 Examples
Google Search BERT re-rankers use consistent WordPiece preprocessing between offline index building and online query scoring. A mismatch where documents are lowercased but queries are not causes query document mismatch and recall loss of 5 to 10 percent.
OpenAI GPT models use minimal normalization with byte level BPE to preserve fidelity across multilingual content and code. This avoids destroying signal but requires training on 300B+ tokens to cover variations like "it's", "its", and "it s".
A production system that normalized "café" to NFC during training but received NFKC normalized inputs at serving saw the acute accent decompose differently, changing "café" from 2 tokens ["ca", "fé"] to 3 tokens ["ca", "f", "é"], breaking downstream span alignment for highlighting.