Embeddings & Similarity SearchDimensionality Reduction (PCA, UMAP)Hard⏱️ ~3 min

Production Implementation and Failure Modes

TRAINING-SERVING SKEW

The most common PCA failure is training-serving skew. You fit PCA on historical data, then serve on new data with different characteristics. As the embedding distribution drifts, the principal components become stale—they no longer capture the directions of maximum variance.

Symptoms: recall drops over time without any change to embeddings or index. New content types cluster in unexpected ways. Performance degrades for specific query segments.

Mitigation: retrain PCA periodically (weekly to monthly depending on drift rate). Monitor explained variance on fresh data—if it drops below 85% of training-time variance, trigger retraining. Include new content types in training samples.

DATA LEAKAGE

If you train PCA on the full dataset including test examples, your evaluation is contaminated. The projection is optimized for the specific vectors you will query, artificially inflating recall metrics.

Fix: strict train-test split before PCA fitting. Train PCA only on training vectors. Apply the learned projection to test vectors as if they were out-of-sample.

CENTERING AND NORMALIZATION

PCA assumes zero-mean data. If you forget to center (subtract the mean), the first principal component points toward the mean instead of capturing variance. Always: (1) compute mean on training data, (2) subtract mean before projection, (3) store mean for serving.

For cosine similarity downstream, normalize vectors after reduction. Reduced vectors inherit the scale of original vectors, which may not be unit length. Normalization ensures cosine similarity works correctly.

VERSION MANAGEMENT

The PCA projection matrix is a model artifact. Version it alongside your embedding model. If you update embeddings, the old PCA matrix may not align with new embedding dimensions or distributions.

Store: projection matrix W, training mean vector, explained variance per component, training data statistics. Deploy projection and mean together. Log which version served each query for debugging.

❗ Critical: Always subtract the training mean before projection at serving time. Forgetting this is the most common PCA bug—queries project incorrectly and recall drops 10-20%.
💡 Key Takeaways
Training-serving skew: PCA on old data does not capture new variance directions
Data leakage: train PCA only on training vectors, apply to test as out-of-sample
Always center data—subtract training mean before projection, store mean for serving
Version PCA artifacts with embedding model; redeploy together on updates
📌 Interview Tips
1Interview Tip: Describe the centering bug—forgetting to subtract mean makes first component point at mean instead of variance direction.
2Interview Tip: Explain how to detect stale PCA—monitor explained variance on fresh data, retrain when it drops significantly.
← Back to Dimensionality Reduction (PCA, UMAP) Overview
Production Implementation and Failure Modes | Dimensionality Reduction (PCA, UMAP) - System Overflow