Natural Language Processing Systems • Text Classification at ScaleHard⏱️ ~3 min
Production Failure Modes and Mitigation Strategies
Long documents are a common failure mode for transformer encoders that truncate by default at 128 to 512 tokens. Critical signals beyond the window are lost, causing unexplained false negatives on long contracts, legal documents, or application logs. Symptoms include accuracy dropping sharply as a function of document length. A model achieving 0.88 F1 on 200 token documents might drop to 0.65 F1 on 2,000 token documents.
Mitigations include smart chunking, hierarchical pooling, or long context models. Smart chunking splits by semantic boundaries like paragraphs or sections, computes chunk embeddings, then pools with attention or max pooling to a document embedding. This preserves signals from the entire document. Long context models like Longformer or recent transformer variants handle 4,096 to 16,384 tokens but require 4 to 8 times more compute. Monitor performance as a function of document length and set alerts when accuracy degrades beyond acceptable thresholds.
Label drift and taxonomy churn invalidate training data when new product lines launch or policies change. Accuracy collapses on emerging labels that the model has never seen. A product classifier trained before a new category launch will misclassify all new items into adjacent categories. Zero shot routing can bridge the gap by adding the new label descriptions immediately without retraining. Follow with supervised retraining as labeled data accumulates. Maintain backward compatible label mappings and deprecate old labels gradually over weeks or months, not overnight.
Adversarial behavior and evasion degrade lexical models. Attackers obfuscate keywords with unicode trickery, deliberate misspellings, or zero width characters to bypass spam filters. Character and subword models are more robust than word level models. Normalize text aggressively, removing unusual unicode and canonicalizing characters. Apply adversarial training by generating synthetic evasion examples and including them in training data. Monitor false negatives from abuse reports and user flags as a signal of emerging evasion tactics.
Serving tail latency can spike under dynamic batching. Dynamic batching improves throughput but causes P99 latency spikes under low traffic because items wait for the batch window to fill. At 10 requests per second with a 10 millisecond batch window, most batches contain only 1 to 2 items and pay the full 10 millisecond delay. Apply timeout based flush that processes partial batches when the window expires, separate latency critical tenants onto dedicated instances with smaller batch windows or no batching, and implement backpressure to shed load gracefully under overload.
Data leakage and near duplicates inflate evaluation metrics. If training and test sets contain duplicated or near duplicate items, the model memorizes rather than generalizes. Symptoms include validation accuracy of 0.95 but production accuracy of 0.75. De-duplicate with robust hashing like MinHash and near duplicate detection with cosine similarity thresholds, typically 0.95 or higher, before splitting train and test sets. Maintain a holdout set from a different time period to catch temporal overfitting.
Generative model hazards include prompt injection, inconsistent label formatting, and verbosity. Users can manipulate prompts to bypass safety filters or produce incorrect labels. Constrain outputs with strict post-processing, use classification oriented prompting that specifies output format explicitly, and verify outputs against a label whitelist. Costs and rate limits can trigger cascading backlogs under traffic spikes. Implement caching for repeated queries, aggressive timeouts, and fast fallback to discriminative models when generative models are overloaded or slow.
💡 Key Takeaways
•Transformer truncation at 512 tokens causes F1 to drop from 0.88 on short documents to 0.65 on 2,000 token documents, mitigate with smart chunking and hierarchical pooling
•Label drift from new categories collapses accuracy to near zero, use zero shot routing for immediate coverage then supervised retraining as labeled data accumulates over weeks
•Adversarial evasion with unicode trickery and misspellings bypasses lexical models, apply character level models, aggressive normalization, and adversarial training with synthetic evasion examples
•Dynamic batching P99 latency spikes under low traffic, apply timeout based flush at 5 to 10 milliseconds and separate latency critical tenants onto dedicated no batching instances
•Data leakage from near duplicates inflates validation accuracy to 0.95 while production drops to 0.75, de-duplicate with MinHash and cosine similarity above 0.95 before train test split
📌 Examples
Legal contract classifier: Accuracy 0.89 on 300 token documents drops to 0.62 on 3,000 token contracts. Implement chunking with 512 token chunks, max pooling across chunks, recover to 0.83 F1
Spam filter evasion: Attacker uses "V1@gra" and zero width spaces to bypass keyword filters. Switch to character level BERT, add adversarial examples to training, reduce evasion success rate from 35% to 8%
Product taxonomy drift: Launch of smart home category causes 2,000 misclassifications per day into electronics. Add zero shot labels immediately, collect 500 labeled examples per week, retrain after 4 weeks to reach 0.86 F1