Production Implementation and Failure Modes
TRAINING-SERVING SKEW
The most common PCA failure is training-serving skew. You fit PCA on historical data, then serve on new data with different characteristics. As the embedding distribution drifts, the principal components become stale—they no longer capture the directions of maximum variance.
Symptoms: recall drops over time without any change to embeddings or index. New content types cluster in unexpected ways. Performance degrades for specific query segments.
Mitigation: retrain PCA periodically (weekly to monthly depending on drift rate). Monitor explained variance on fresh data—if it drops below 85% of training-time variance, trigger retraining. Include new content types in training samples.
DATA LEAKAGE
If you train PCA on the full dataset including test examples, your evaluation is contaminated. The projection is optimized for the specific vectors you will query, artificially inflating recall metrics.
Fix: strict train-test split before PCA fitting. Train PCA only on training vectors. Apply the learned projection to test vectors as if they were out-of-sample.
CENTERING AND NORMALIZATION
PCA assumes zero-mean data. If you forget to center (subtract the mean), the first principal component points toward the mean instead of capturing variance. Always: (1) compute mean on training data, (2) subtract mean before projection, (3) store mean for serving.
For cosine similarity downstream, normalize vectors after reduction. Reduced vectors inherit the scale of original vectors, which may not be unit length. Normalization ensures cosine similarity works correctly.
VERSION MANAGEMENT
The PCA projection matrix is a model artifact. Version it alongside your embedding model. If you update embeddings, the old PCA matrix may not align with new embedding dimensions or distributions.
Store: projection matrix W, training mean vector, explained variance per component, training data statistics. Deploy projection and mean together. Log which version served each query for debugging.