Embeddings & Similarity SearchIndex Management (Building, Updating, Sharding)Medium⏱️ ~2 min

Index Building: Batch Construction vs Incremental Updates

BUILDING HNSW INDEXES

HNSW indexes require training-time construction of the graph structure. Each vector is inserted by finding its nearest neighbors in the current graph and adding edges. Order of insertion matters—later vectors see a more complete graph.

Key parameter: efConstruction controls how many candidates to consider during insertion. Higher values (200-500) produce better graphs but slower builds. For 100M vectors with efConstruction=200, expect 4-6 hours on 32 CPUs.

Build is CPU-bound (graph traversal, distance computation). Parallelize across cores. Memory requirement: full vectors + graph structure = ~4KB per 768-dim vector.

BUILDING IVF-PQ INDEXES

IVF-PQ requires two training phases: k-means clustering for IVF partitions, and codebook training for PQ compression.

IVF training: Run k-means on a sample of vectors (1M-10M) to find centroids. Number of centroids = sqrt(N) to 4×sqrt(N). For 100M vectors, use 10K-40K centroids. Training time: 30-60 minutes on GPU.

PQ training: Learn codebooks by running k-means on each subvector dimension. Train on same sample as IVF. Training time: 10-30 minutes.

Index population: Assign each vector to nearest centroid, compute PQ codes. This is embarrassingly parallel—process in batches across many workers. 100M vectors: 1-2 hours.

VALIDATION BEFORE DEPLOYMENT

Before serving, measure recall@K on a held-out query set with ground truth. Sample 10K queries, compute exact nearest neighbors, compare to ANN results. Target: recall@10 ≥ 0.95.

If recall is low, diagnose: Are centroids well-distributed? Is PQ quantization error too high? Increase nlist, reduce M, or use higher-precision codes.

✅ Best Practice: Always train on a representative sample of production data. Index trained on old data may have misaligned centroids for new content.
💡 Key Takeaways
HNSW: efConstruction controls graph quality; 200-500 typical; 4-6 hours for 100M vectors
IVF-PQ: k-means for centroids + codebook training; sqrt(N) to 4×sqrt(N) centroids
Validate with recall@K on held-out queries before deployment; target 0.95+
📌 Interview Tips
1Interview Tip: Explain the two-phase IVF-PQ build—k-means for partitions, then PQ codebook training.
2Interview Tip: Describe validation—10K query sample, compute exact neighbors, compare to ANN results.
← Back to Index Management (Building, Updating, Sharding) Overview