Index Building: Batch Construction vs Incremental Updates
BUILDING HNSW INDEXES
HNSW indexes require training-time construction of the graph structure. Each vector is inserted by finding its nearest neighbors in the current graph and adding edges. Order of insertion matters—later vectors see a more complete graph.
Key parameter: efConstruction controls how many candidates to consider during insertion. Higher values (200-500) produce better graphs but slower builds. For 100M vectors with efConstruction=200, expect 4-6 hours on 32 CPUs.
Build is CPU-bound (graph traversal, distance computation). Parallelize across cores. Memory requirement: full vectors + graph structure = ~4KB per 768-dim vector.
BUILDING IVF-PQ INDEXES
IVF-PQ requires two training phases: k-means clustering for IVF partitions, and codebook training for PQ compression.
IVF training: Run k-means on a sample of vectors (1M-10M) to find centroids. Number of centroids = sqrt(N) to 4×sqrt(N). For 100M vectors, use 10K-40K centroids. Training time: 30-60 minutes on GPU.
PQ training: Learn codebooks by running k-means on each subvector dimension. Train on same sample as IVF. Training time: 10-30 minutes.
Index population: Assign each vector to nearest centroid, compute PQ codes. This is embarrassingly parallel—process in batches across many workers. 100M vectors: 1-2 hours.
VALIDATION BEFORE DEPLOYMENT
Before serving, measure recall@K on a held-out query set with ground truth. Sample 10K queries, compute exact nearest neighbors, compare to ANN results. Target: recall@10 ≥ 0.95.
If recall is low, diagnose: Are centroids well-distributed? Is PQ quantization error too high? Increase nlist, reduce M, or use higher-precision codes.