Principal Component Analysis (PCA) for Online Systems
HOW PCA WORKS
Principal Component Analysis finds directions in the data that capture the most variance. Imagine a cloud of points in 3D space that is flat like a pancake—most of the spread is in two directions. PCA finds those two directions (principal components) and projects points onto them, reducing from 3D to 2D with minimal information loss.
Mathematically, PCA computes the covariance matrix of the data, then finds its eigenvectors. The eigenvector with the largest eigenvalue is the first principal component—the direction of maximum variance. The second component is perpendicular to the first and captures the next most variance, and so on. You keep the top k components and discard the rest.
COMPUTATIONAL COST
Computing full PCA on N vectors of dimension D costs O(N × D²) for the covariance matrix and O(D³) for eigen decomposition. For 100 million vectors at 768 dimensions, this is prohibitively expensive.
Solution: randomized SVD. Instead of exact eigen decomposition, use randomized algorithms that approximate the top k components in O(N × D × k) time. This is orders of magnitude faster for large-scale data. Libraries like sklearn and FAISS implement randomized PCA.
PROJECTION IS A MATRIX MULTIPLY
After training, PCA produces a projection matrix W of shape (D, k). Reducing a new vector is a single matrix multiplication: reduced = original @ W. This is O(D × k) per vector—fast enough for online inference.
Example: 768-dim to 128-dim projection is 768 × 128 = 98,304 operations per vector. At 10 GFLOPs, that is 10 microseconds. Negligible compared to embedding model inference (10-50ms).
HOW MUCH TO REDUCE
Rule of thumb: keep enough components to explain 90-95% of variance. For typical text embeddings (768-dim), reducing to 128-256 dims often retains 90%+ variance. Verify by measuring recall@k before and after reduction on your actual retrieval task.
If recall drops significantly, you are losing signal. Try reducing less aggressively (256 dims instead of 128). If recall is unchanged, you can likely reduce further (64 dims). The right target depends on your data distribution.