Embeddings & Similarity SearchApproximate Nearest Neighbors (FAISS, ScaNN, HNSW)Medium⏱️ ~2 min

FAISS IVF-PQ: Inverted File with Product Quantization

FAISS (Facebook AI Similarity Search) Inverted File with Product Quantization (IVF PQ) combines coarse partitioning with aggressive compression. The algorithm first clusters vectors into coarse centroids using k means. At query time, it probes a small subset of nearest centroids, then computes distances only within those partitions. Product Quantization further compresses vectors by splitting dimensions into subspaces, quantizing each subspace independently with learned codebooks, and storing short codes instead of full floats. For 1 billion vectors with 768 dimensions, you might use 262144 coarse centroids and 16 subquantizers with 8 bit codes. This compresses each vector from 3072 bytes (768 floats times 4 bytes) to just 16 bytes, a 192 times reduction. Total dataset size drops from 3 TB to roughly 16 GB. At serving, you set nprobe to 16 or 32, meaning you search only 16 out of 262144 partitions. Distance computation uses lookup tables built from query and codebooks, making it extremely fast even on compressed representations. The tradeoff is quantization error and training complexity. PQ introduces approximation in distance computation. Tuning the number of coarse centroids, subquantizers, and nprobe requires experimentation. If centroids are poorly balanced, some lists become very long and queries hitting those lists spike in latency. FAISS on GPU can process 5000 to 10000 queries per second at 95 to 97 percent recall with batch sizes of 32 to 256. Pinterest reported moving to GPU FAISS to handle tens of billions of image embeddings while reducing infrastructure cost. The method scales to billions of vectors but requires periodic retraining as data drifts.
💡 Key Takeaways
IVF PQ compresses 768 dimensional float32 vectors from 3072 bytes to 16 bytes using 16 subquantizers with 8 bit codes, achieving 192 times compression
At 1 billion vectors, total memory drops from 3 TB to 16 GB plus index overhead, enabling single GPU deployment or small CPU clusters
Coarse quantization with 65536 to 262144 centroids partitions space; nprobe parameter controls recall versus speed tradeoff, typical values are 16 to 64
GPU FAISS achieves 5000 to 10000 queries per second at 95 to 97 percent recall with batch size 32 to 256, p99 latency under 15 milliseconds
Training cost: clustering 1 billion vectors into 262144 centroids and learning 16 PQ codebooks takes hours on multi GPU setups; requires periodic retraining as data drifts
List imbalance causes latency spikes: if hot centroids accumulate 10 times more vectors, queries probing those lists slow down; monitor average list length and p99 latency separately
📌 Examples
Pinterest visual search: 30 billion image embeddings, FAISS IVF PQ on A100 GPUs, 65536 coarse centroids, nprobe=32, 10000 QPS at 96 percent recall, 5 milliseconds per batch
Meta content deduplication: FAISS IVF PQ for near duplicate detection on billions of posts, compressed index fits in memory, probes 16 lists, achieves 95 percent recall in under 10 milliseconds
Retraining cadence: weekly retrain on 500 million new vectors, uses sampled subset for centroid learning, takes 6 hours on 8 V100 GPUs, atomic swap to new index with dual write during cutover
← Back to Approximate Nearest Neighbors (FAISS, ScaNN, HNSW) Overview