Learn→Embeddings & Similarity Search→Dimensionality Reduction (PCA, UMAP)→1 of 6

Embeddings & Similarity Search • Dimensionality Reduction (PCA, UMAP)Easy⏱️ ~2 min

What is Dimensionality Reduction and Why Do We Need It?

Definition
Dimensionality reduction maps high-dimensional data to a lower-dimensional space while preserving the structure that matters for downstream tasks. A 768-dimensional embedding becomes 128 dimensions, reducing memory by 6x while keeping enough information for similarity search or classification.
THE CURSE OF DIMENSIONALITY
High-dimensional vectors create three problems. First, storage: 1 billion vectors at 768 dimensions and 4 bytes per float is 3 terabytes. Second, computation: distance calculations scale linearly with dimensions—768 multiplications per pair. Third, and less obvious: in high dimensions, the difference between nearest and farthest neighbors shrinks, making all points appear roughly equidistant.
This last effect is called the curse of dimensionality. When every point is equidistant, similarity search fails—you cannot distinguish truly similar items from random ones. Reducing dimensions can actually improve search quality by removing noise dimensions that add spurious distance.
WHAT STRUCTURE TO PRESERVE
Global structure: Keep overall distances and relationships roughly correct. If points A and B were far apart, they should remain far apart. PCA and random projection preserve global linear structure.
Local structure: Keep nearby neighbors as neighbors. Points that were close should stay close, even if distant points move around. UMAP and t-SNE focus on local structure.
The choice depends on your task. For ANN search, global structure matters—you need distances to remain meaningful. For visualization and clustering, local structure matters—you want clusters to be visible and separable.
LINEAR VS NONLINEAR
Linear methods (PCA): Project data onto a subspace via matrix multiplication. Fast, simple, out-of-sample extension is trivial. Works well when the data lies on or near a linear subspace.
Nonlinear methods (UMAP, t-SNE): Learn curved manifolds that better capture complex structure. Better for visualization but harder to apply to new points. Computationally expensive.
⚠️ Key Trade-off: Dimensionality reduction is lossy. Reducing 768 dims to 128 loses information. The question is whether the lost information was noise or signal. Evaluate on your actual task metrics, not just reconstruction error.

💡 Key Takeaways

✓High dimensions cause storage, computation, and curse of dimensionality problems

✓Global structure (PCA) preserves overall distances; local structure (UMAP) preserves neighborhoods

✓Linear methods are fast with easy out-of-sample extension; nonlinear methods capture complex patterns

✓Reduction is lossy—evaluate on task metrics, not just reconstruction error

📌 Interview Tips

1Interview Tip: Explain the curse of dimensionality—in high dimensions, nearest and farthest neighbors become equidistant, breaking similarity search.

2Interview Tip: Describe when to preserve global vs local structure based on task requirements.

← Back to Dimensionality Reduction (PCA, UMAP) Overview