Production Failure Modes: Drift, Truncation, and Domain Mismatch
EMBEDDING DRIFT
As content changes over time, embedding distributions shift. New topics, products, or user segments appear that were not represented in training data. The embedding model never learned to represent these new concepts, so their vectors land in arbitrary locations in the embedding space.
Symptoms: recall drops gradually over weeks or months. New items cluster poorly with related old items. Popular new categories rank worse than stale old items. Detection: monitor recall@K on a fresh validation set weekly. If recall drops 5%+ from baseline, embeddings are drifting.
Fix: retrain embeddings periodically (monthly to quarterly depending on content velocity). Use incremental training if full retraining is too expensive—fine-tune on recent data without forgetting old patterns.
TRUNCATION ARTIFACTS
Most embedding models have token limits: 512 tokens for BERT, 8192 for some newer models. Long documents get truncated, losing tail content. A 10-page document embedded as first 512 tokens ignores 90% of the content—potentially the most important 90%.
Fixes: chunk documents into sections, embed each chunk, aggregate embeddings (mean, max-pool, or attention-weighted). Or use long-context models that handle 4K-8K tokens. Trade-off: longer context = slower inference = higher compute cost.
DOMAIN MISMATCH
Pre-trained embeddings (trained on general web text) may not transfer to specialized domains: legal documents, medical records, source code, financial reports. The model never saw your domain vocabulary or writing style.
Detection: embed domain-specific pairs that you know are similar, check if their cosine similarity is high (>0.8). If not, the pre-trained model fails on your domain.
Fix: fine-tune on domain data. Collect 10K-100K pairs of similar items from your domain. Fine-tune for 1-3 epochs. Validate that recall improves 10-20% on domain-specific benchmarks.