Recommendation Systems • Scalability (ANN, HNSW, FAISS)Medium⏱️ ~3 min
IVF and Product Quantization: Compression for Billion Scale Search
Inverted File with Product Quantization (IVF+PQ) is the workhorse of billion scale ANN at companies like Meta. IVF partitions the vector space into many coarse cells (typically 1,000 to 100,000 clusters) by training centroids via k means clustering. Each vector is assigned to its nearest centroid, creating inverted lists similar to text search indices. At query time, you probe only a handful of lists (nprobe parameter, typically 10 to 200), dramatically reducing the search space.
Product Quantization compresses vectors by splitting each dimension into subspaces, quantizing each subspace independently with small codebooks (typically 256 entries per subspace), and representing the full vector with concatenated codes. A 768 dimensional float32 vector (3 KB) can be compressed to 64 to 128 bytes using PQ, a 24 to 48 times reduction. This allows 100 million vectors to fit in approximately 6 to 12 GB instead of 300 GB. Distance computation uses precomputed lookup tables, enabling fast approximate distance calculations directly on compressed codes.
In the Deep1B benchmark on a 64 GB RAM host, FAISS IVF+PQ outperformed all tested methods, completing preprocessing and indexing within the memory budget while HNSW implementations failed. FAISS achieved approximately 978 queries per second (QPS) at 0.95 recall on 1 million vectors using approximately 18 GB RAM. On the 54 million vector subset, FAISS IVF completed 66 percent of the parameter sweep successfully, demonstrating robustness under memory constraints.
The trade off is that PQ introduces quantization error, losing fine grained distance information. Without re ranking, expect several percentage points of recall loss compared to exact search. Production systems typically use a two stage approach: IVF+PQ retrieves the top 100 to 1,000 candidates quickly, then re ranks them using full precision vectors stored separately (in memory for hot vectors, on SSD for the long tail). This recovers most accuracy while keeping latency low.
💡 Key Takeaways
•Memory efficiency is the standout advantage: 100 million 768 dimensional vectors compress from 300 GB to approximately 6 to 12 GB using Product Quantization, enabling billion scale search on commodity hardware
•FAISS IVF+PQ achieved approximately 978 QPS at 0.95 recall on 1 million vectors with 18 GB RAM, and successfully completed indexing on 54 million vectors under 64 GB where HNSW failed
•The nprobe parameter controls recall versus latency: probing 50 lists might give 0.90 recall in 5 milliseconds, while probing 200 lists gives 0.95 recall in 15 milliseconds, tunable per query
•Two stage retrieval pattern is standard in production: IVF+PQ generates top 100 to 1,000 candidates in milliseconds, then re rank with full precision vectors to recover accuracy lost to quantization
•Distribution drift is a key failure mode: coarse centroids trained on old data lead to skewed lists and degraded recall, requiring periodic retraining and reindexing (typically weekly to monthly)
📌 Examples
Meta FAISS processes billions of queries daily across Instagram visual search, Facebook content understanding, and recommendation systems using IVF+PQ with GPU acceleration
Typical IVF+PQ configuration for 100 million vectors: nlist equals 8192 clusters, nprobe equals 50 to 100, Product Quantization with m equals 64 bytes, delivering 0.92 to 0.95 recall at 10 to 20 milliseconds per query
A production system might store PQ codes in memory (6 GB for 100M vectors), keep full precision vectors for the top 10 million popular items in memory (30 GB), and fetch long tail full precision vectors from SSD for re ranking