IVF and Product Quantization: Compression for Billion Scale Search

INVERTED FILE INDEX
IVF (Inverted File) divides the vector space into clusters. Train k-means on a sample of vectors to create cluster centroids (typically 1,000 to 10,000 clusters for 100M vectors). Each vector is assigned to its nearest centroid. At query time, find the closest centroids to the query and only search vectors in those clusters. Searching 10 clusters out of 1,000 reduces comparisons by 100x.
PRODUCT QUANTIZATION
Product Quantization (PQ) compresses vectors to reduce memory. Split each 128 dimension vector into 8 subvectors of 16 dimensions each. For each subvector position, learn 256 representative codes. Now represent any vector as 8 bytes (8 positions × 1 byte index into 256 codes) instead of 512 bytes (128 × 4 byte floats). This is 64x compression. Distance computation uses precomputed tables, staying fast despite compression.
IVF-PQ: COMBINING BOTH
The combination is powerful: IVF reduces the number of vectors to check, PQ reduces memory per vector. 100M vectors with 128 dimensions: raw storage is 51 GB. With IVF-PQ, storage drops to under 1 GB. Query latency increases slightly (10-30ms) compared to HNSW (under 5ms), but memory efficiency makes billion scale search feasible.
💡 Key Insight: IVF-PQ trades query latency for memory efficiency. When you cannot fit vectors in RAM or need to search billions, IVF-PQ is the answer.
ACCURACY LOSS
PQ introduces quantization error: the compressed vector is not identical to the original. Recall drops 2 to 5% compared to exact search. For applications requiring high precision (duplicate detection, exact matching), this may be unacceptable. For recommendations where approximate results suffice, it works well.

💡 Key Takeaways

✓IVF: cluster vectors into 1000-10000 groups, search only nearby clusters for 100x fewer comparisons

✓PQ: compress 128-dim vector from 512 bytes to 8 bytes (64x compression) using learned codes

✓IVF-PQ combined: 100M vectors from 51 GB down to under 1 GB with 10-30ms latency

✓Trade-off: memory efficiency for latency (10-30ms vs HNSW under 5ms) and 2-5% recall loss

✓Best for billion-scale search where data cannot fit in RAM

📌 Interview Tips

1Walk through IVF: 100M vectors / 1000 clusters = 100K per cluster, search 10 clusters = 1M comparisons

2Explain PQ compression: 128 dims → 8 subvectors × 8-bit codes = 8 bytes total

3Compare memory: 51 GB raw vs 1 GB compressed for 100M 128-dim vectors

← Back to Scalability (ANN, HNSW, FAISS) Overview