Failure Modes: Train Serve Skew, Unicode Pitfalls, and Span Alignment
Train-Serve Skew
Training uses one preprocessing pipeline; serving uses another. Training normalized with NFKC; serving uses NFC. Training lowercased; serving does not. The model learned patterns on processed text but receives differently processed text at inference. Accuracy drops 5-15% with no obvious error.
Detection: Compare token distributions between training and production. If production generates tokens the model rarely saw during training, you have preprocessing skew. Log the top 100 most frequent tokens in both environments.
Fix: Package preprocessing and tokenization as a single library. Training and serving import the same code. Version the tokenizer with the model. Never update preprocessing without retraining.
Unicode Pitfalls
Zero-width characters: Zero-width space, zero-width joiner, and similar invisible characters exist in user input. They tokenize to unknown tokens or split words unexpectedly. "Hello" with a zero-width space in the middle becomes ["Hel", "lo"] instead of ["Hello"].
Homoglyphs: Cyrillic "а" looks identical to Latin "a" but has a different codepoint. Attackers use this for phishing. Your model may treat them differently. Normalize or reject non-ASCII lookalikes in security-sensitive applications.
Span Alignment
For named entity recognition (NER), you label character spans: "New York" is characters 0-8. Tokenization produces ["New", "York"] or ["New", "Yor", "k"] depending on the algorithm. Mapping character offsets to token offsets requires alignment tables.
Subword boundary labels: If "New York" becomes ["New", "Yor", "k"], which tokens get the entity label? Common schemes: label first token only (BIO tagging), label all tokens (adds redundancy), label first and last (span markers).