Learn→Embeddings & Similarity Search→Index Management (Building, Updating, Sharding)→2 of 6

Embeddings & Similarity Search • Index Management (Building, Updating, Sharding)Medium⏱️ ~2 min

Index Building: Batch Construction vs Incremental Updates

BUILDING HNSW INDEXES
HNSW indexes require training-time construction of the graph structure. Each vector is inserted by finding its nearest neighbors in the current graph and adding edges. Order of insertion matters—later vectors see a more complete graph.
Key parameter: efConstruction controls how many candidates to consider during insertion. Higher values (200-500) produce better graphs but slower builds. For 100M vectors with efConstruction=200, expect 4-6 hours on 32 CPUs.
Build is CPU-bound (graph traversal, distance computation). Parallelize across cores. Memory requirement: full vectors + graph structure = ~4KB per 768-dim vector.
BUILDING IVF-PQ INDEXES
IVF-PQ requires two training phases: k-means clustering for IVF partitions, and codebook training for PQ compression.
IVF training: Run k-means on a sample of vectors (1M-10M) to find centroids. Number of centroids = sqrt(N) to 4×sqrt(N). For 100M vectors, use 10K-40K centroids. Training time: 30-60 minutes on GPU.
PQ training: Learn codebooks by running k-means on each subvector dimension. Train on same sample as IVF. Training time: 10-30 minutes.
Index population: Assign each vector to nearest centroid, compute PQ codes. This is embarrassingly parallel—process in batches across many workers. 100M vectors: 1-2 hours.
VALIDATION BEFORE DEPLOYMENT
Before serving, measure recall@K on a held-out query set with ground truth. Sample 10K queries, compute exact nearest neighbors, compare to ANN results. Target: recall@10 ≥ 0.95.
If recall is low, diagnose: Are centroids well-distributed? Is PQ quantization error too high? Increase nlist, reduce M, or use higher-precision codes.
✅ Best Practice: Always train on a representative sample of production data. Index trained on old data may have misaligned centroids for new content.

💡 Key Takeaways

✓HNSW: efConstruction controls graph quality; 200-500 typical; 4-6 hours for 100M vectors

✓IVF-PQ: k-means for centroids + codebook training; sqrt(N) to 4×sqrt(N) centroids

✓Validate with recall@K on held-out queries before deployment; target 0.95+

📌 Interview Tips

1Interview Tip: Explain the two-phase IVF-PQ build—k-means for partitions, then PQ codebook training.

2Interview Tip: Describe validation—10K query sample, compute exact neighbors, compare to ANN results.

← Back to Index Management (Building, Updating, Sharding) Overview