Preprocessing Pipeline: Normalization and Text Cleaning
Why Preprocessing Matters
Identical text can have different byte representations. "Café" could be encoded as a single character (é) or as two characters (e followed by combining accent). Without normalization, these appear as different tokens to the model, fragmenting the vocabulary and reducing model quality.
Preprocessing standardizes text before tokenization. The goal: ensure semantically identical inputs produce identical token sequences. This affects both training data quality and inference consistency.
Unicode Normalization
NFC (Composed): Combines base character with accent into single codepoint. é becomes one character. Most web content uses NFC. This is the standard choice for tokenization.
NFD (Decomposed): Separates base character and accent into distinct codepoints. é becomes e + combining acute accent. Useful when you need to strip accents for search applications.
NFKC/NFKD: Compatibility normalization. Converts fullwidth characters to ASCII equivalents, ligatures to separate letters, circled numbers to regular digits. Essential for CJK text processing.
Text Cleaning Steps
Lowercasing: Reduces vocabulary size by 30-40% but loses information. "Apple" (company) and "apple" (fruit) become identical. Use for search and classification; avoid for generation tasks.
Whitespace normalization: Multiple spaces, tabs, and newlines collapse to single spaces. Invisible Unicode spaces (non-breaking, zero-width) convert to regular spaces.
Special character handling: HTML entities (& → &) decoded. URLs and emails optionally replaced with [URL] and [EMAIL] tokens to prevent vocabulary pollution.
Order Matters