Natural Language Processing SystemsTokenization & PreprocessingHard⏱️ ~3 min

Failure Modes: Train Serve Skew, Unicode Pitfalls, and Span Alignment

Train-Serve Skew

Training uses one preprocessing pipeline; serving uses another. Training normalized with NFKC; serving uses NFC. Training lowercased; serving does not. The model learned patterns on processed text but receives differently processed text at inference. Accuracy drops 5-15% with no obvious error.

Detection: Compare token distributions between training and production. If production generates tokens the model rarely saw during training, you have preprocessing skew. Log the top 100 most frequent tokens in both environments.

Fix: Package preprocessing and tokenization as a single library. Training and serving import the same code. Version the tokenizer with the model. Never update preprocessing without retraining.

Unicode Pitfalls

Zero-width characters: Zero-width space, zero-width joiner, and similar invisible characters exist in user input. They tokenize to unknown tokens or split words unexpectedly. "Hello" with a zero-width space in the middle becomes ["Hel", "lo"] instead of ["Hello"].

Homoglyphs: Cyrillic "а" looks identical to Latin "a" but has a different codepoint. Attackers use this for phishing. Your model may treat them differently. Normalize or reject non-ASCII lookalikes in security-sensitive applications.

Span Alignment

For named entity recognition (NER), you label character spans: "New York" is characters 0-8. Tokenization produces ["New", "York"] or ["New", "Yor", "k"] depending on the algorithm. Mapping character offsets to token offsets requires alignment tables.

⚠️ Alignment Bug: If tokenization does not preserve offset mapping, you cannot convert model predictions back to character positions. Most tokenizers provide offset_mapping output. Always verify this works for your edge cases.

Subword boundary labels: If "New York" becomes ["New", "Yor", "k"], which tokens get the entity label? Common schemes: label first token only (BIO tagging), label all tokens (adds redundancy), label first and last (span markers).

💡 Key Takeaways
Train-serve skew: different preprocessing in training vs serving drops accuracy 5-15%
Package preprocessing and tokenizer as single library used by both training and serving
Zero-width characters split words unexpectedly; homoglyphs look identical but are different codepoints
Span alignment maps character offsets to token offsets for NER; verify offset_mapping works
Subword boundary labels: BIO tags first token only, span markers label first and last
📌 Interview Tips
1Show train-serve skew detection: compare top 100 token frequencies between environments
2Explain zero-width character problem: invisible character splits 'Hello' into ['Hel', 'lo']
3For NER, mention offset_mapping: must preserve character-to-token alignment for predictions
← Back to Tokenization & Preprocessing Overview