Natural Language Processing SystemsTokenization & PreprocessingMedium⏱️ ~2 min

Preprocessing Pipeline: Normalization and Text Cleaning

Why Preprocessing Matters

Identical text can have different byte representations. "Café" could be encoded as a single character (é) or as two characters (e followed by combining accent). Without normalization, these appear as different tokens to the model, fragmenting the vocabulary and reducing model quality.

Preprocessing standardizes text before tokenization. The goal: ensure semantically identical inputs produce identical token sequences. This affects both training data quality and inference consistency.

Unicode Normalization

NFC (Composed): Combines base character with accent into single codepoint. é becomes one character. Most web content uses NFC. This is the standard choice for tokenization.

NFD (Decomposed): Separates base character and accent into distinct codepoints. é becomes e + combining acute accent. Useful when you need to strip accents for search applications.

NFKC/NFKD: Compatibility normalization. Converts fullwidth characters to ASCII equivalents, ligatures to separate letters, circled numbers to regular digits. Essential for CJK text processing.

Text Cleaning Steps

Lowercasing: Reduces vocabulary size by 30-40% but loses information. "Apple" (company) and "apple" (fruit) become identical. Use for search and classification; avoid for generation tasks.

Whitespace normalization: Multiple spaces, tabs, and newlines collapse to single spaces. Invisible Unicode spaces (non-breaking, zero-width) convert to regular spaces.

Special character handling: HTML entities (& → &) decoded. URLs and emails optionally replaced with [URL] and [EMAIL] tokens to prevent vocabulary pollution.

Order Matters

⚠️ Pipeline Order: Unicode normalization → HTML decoding → whitespace normalization → lowercasing (optional) → tokenization. Wrong order causes subtle bugs: lowercasing before normalization may miss certain Unicode uppercase variants.
💡 Key Takeaways
Identical text can have different byte representations; normalization ensures consistent tokens
NFC (composed) is standard for tokenization; NFKC essential for CJK text
Lowercasing reduces vocabulary 30-40% but loses semantic information
Pipeline order matters: normalize → decode HTML → whitespace → lowercase → tokenize
Replace URLs and emails with placeholder tokens to prevent vocabulary pollution
📌 Interview Tips
1Explain NFC vs NFD: composed puts accent in one character, decomposed separates it
2Show lowercasing trade-off: reduces vocabulary but Apple (company) = apple (fruit)
3Mention pipeline order: wrong order causes subtle bugs with Unicode uppercase variants
← Back to Tokenization & Preprocessing Overview