UMAP for Offline Visualization and Clustering
WHAT UMAP DOES DIFFERENTLY
UMAP (Uniform Manifold Approximation and Projection) is a nonlinear method that preserves local neighborhood structure. It builds a graph where each point connects to its k nearest neighbors, then finds a low-dimensional layout where those neighborhood relationships are preserved as closely as possible.
Unlike PCA which projects onto linear subspaces, UMAP can unroll curved manifolds. If your data lies on a Swiss roll (a 2D surface curled in 3D), PCA squashes it flat and destroys structure. UMAP unrolls it back to 2D, revealing the original relationships.
HOW UMAP WORKS
Step 1: Build a weighted k-nearest-neighbor graph in the original high-dimensional space. Edge weights decrease with distance—nearby neighbors have strong connections, distant neighbors have weak ones.
Step 2: Initialize points in low-dimensional space (usually 2D or 3D for visualization).
Step 3: Iteratively adjust low-dimensional positions. Move connected points closer together, push non-connected points apart. The optimization minimizes cross-entropy between high-dimensional and low-dimensional neighborhood probabilities.
The key parameters are n_neighbors (how many neighbors to consider, typically 15-50) and min_dist (how tightly points can cluster, 0.0-1.0).
COMPUTATIONAL COST
UMAP is expensive: O(N × k × log N) for graph construction using approximate nearest neighbor search, plus O(N × iterations) for optimization. For 1 million vectors, expect minutes to hours depending on parameters.
UMAP does not naturally handle out-of-sample points. You must either retrain on the expanded dataset or use a learned parametric extension (a neural network trained to approximate the UMAP mapping).
WHEN TO USE UMAP
Good for: Visualization (projecting embeddings to 2D for human inspection), exploratory analysis, clustering analysis, understanding embedding space structure.
Bad for: Online inference (too slow, no easy out-of-sample extension), preserving exact distances (UMAP distorts global structure), very high-dimensional targets (best for 2-3D).