Recommendation Systems • Scalability (ANN, HNSW, FAISS)Hard⏱️ ~3 min
Choosing the Right Index: Decision Framework and Capacity Planning
Selecting between HNSW, IVF+PQ, or disk optimized designs requires evaluating scale, latency requirements, memory budget, and update patterns together. Start with memory capacity planning as the primary constraint. For 768 dimensional float32 embeddings, raw storage is approximately 3 KB per vector. At 10 million vectors you need 30 GB; at 100 million you need 300 GB. HNSW adds 2 to 5 times overhead (so 60 to 150 GB and 600 GB to 1.5 TB respectively), while IVF+PQ compresses to 64 to 128 bytes per vector (640 MB to 1.28 GB and 6.4 GB to 12.8 GB).
If your dataset exceeds 1 to 2 times your available RAM, IVF+PQ is the pragmatic choice. Meta uses FAISS IVF+PQ to handle billions of vectors across recommendation and search systems because the compression ratio makes it feasible. Tune nlist (number of clusters, typically square root of dataset size to 10 times square root) and nprobe (10 to 200 depending on recall target) through parameter sweeps under realistic query load. Use two stage retrieval: store PQ codes in memory, keep full precision vectors for popular items in memory and long tail on SSD, re rank top 100 to 1,000 candidates to recover accuracy.
If your dataset fits comfortably in memory (with 30 to 50 percent headroom for growth and OS buffers) and you need sub 10 millisecond p99 latency with 0.95 plus recall, HNSW is compelling. It delivers stable, predictable latencies and supports dynamic updates natively. However, plan for periodic rebuilds (weekly to monthly) to defragment and restore graph quality. If you need hybrid keyword plus vector search, integrating HNSW into a search engine stack like OpenSearch is operationally simpler than maintaining separate systems.
When your dataset exceeds available RAM but you cannot afford the memory upgrade, disk optimized designs like DiskANN or hybrid IVF_HNSW_PQ become necessary. These trade higher latency (30 to 100 milliseconds) for cost efficiency and predictable performance when data spills to disk. This is particularly attractive for RAG applications where LLM generation takes 500 milliseconds or more; spending 50 milliseconds on retrieval is negligible. Invest in NVMe SSDs and implement aggressive prefetching, caching hot items in memory, and block structured layouts to minimize random I/O.
💡 Key Takeaways
•Memory planning is the first decision gate: 100 million 768 dimensional vectors need 300 GB raw, 600 GB to 1.5 TB with HNSW, or 6 to 12 GB with IVF+PQ; if dataset exceeds 1 to 2 times RAM choose IVF+PQ or disk
•HNSW delivers sub 10 millisecond p99 latency at 0.95 plus recall when fully in memory, but fails to build beyond tens of millions of vectors on constrained hardware; best for up to 50 million vectors on 128 GB RAM with dynamic updates
•IVF+PQ scales to billions with tunable nprobe (10 to 200) for recall versus latency; pair with two stage retrieval (PQ candidates plus full precision re rank) to recover accuracy; Meta FAISS is the production reference implementation
•Disk optimized designs trade 30 to 100 millisecond latency for cost efficiency and predictable out of memory behavior; ideal when downstream latency (e.g., LLM generation at 500 milliseconds) dominates the budget
•Parameter sweeps under realistic load are mandatory: test nlist, nprobe, efSearch combinations, measure p95 and p99 latency plus recall, choose operating points with 20 to 30 percent headroom below SLO thresholds
📌 Examples
Decision example: 80 million vector dataset, 64 GB RAM available, 15 millisecond p99 SLO. Choose IVF+PQ with nlist equals 4096, nprobe equals 80, expect approximately 10 GB index size, 12 millisecond p99, 0.93 recall. Re rank top 500 with cached full precision to hit 0.95 recall.
Google ScaNN offers faster single query latency than FAISS IVF+PQ at similar recall but higher memory use and fewer tuning parameters; choose when latency is critical and memory is abundant
Capacity planning: Start with nlist approximately square root of dataset size, efSearch or nprobe at 50, measure recall and latency, iterate. For 100M vectors try nlist equals 10000 and nprobe equals 100 as baseline.