What is Dimensionality Reduction and Why Do We Need It?
THE CURSE OF DIMENSIONALITY
High-dimensional vectors create three problems. First, storage: 1 billion vectors at 768 dimensions and 4 bytes per float is 3 terabytes. Second, computation: distance calculations scale linearly with dimensions—768 multiplications per pair. Third, and less obvious: in high dimensions, the difference between nearest and farthest neighbors shrinks, making all points appear roughly equidistant.
This last effect is called the curse of dimensionality. When every point is equidistant, similarity search fails—you cannot distinguish truly similar items from random ones. Reducing dimensions can actually improve search quality by removing noise dimensions that add spurious distance.
WHAT STRUCTURE TO PRESERVE
Global structure: Keep overall distances and relationships roughly correct. If points A and B were far apart, they should remain far apart. PCA and random projection preserve global linear structure.
Local structure: Keep nearby neighbors as neighbors. Points that were close should stay close, even if distant points move around. UMAP and t-SNE focus on local structure.
The choice depends on your task. For ANN search, global structure matters—you need distances to remain meaningful. For visualization and clustering, local structure matters—you want clusters to be visible and separable.
LINEAR VS NONLINEAR
Linear methods (PCA): Project data onto a subspace via matrix multiplication. Fast, simple, out-of-sample extension is trivial. Works well when the data lies on or near a linear subspace.
Nonlinear methods (UMAP, t-SNE): Learn curved manifolds that better capture complex structure. Better for visualization but harder to apply to new points. Computationally expensive.