Recommendation Systems • Scalability (ANN, HNSW, FAISS)Hard⏱️ ~3 min
Memory vs Disk Trade-offs: When Data Exceeds RAM
The performance cliff when vector indices exceed available memory is one of the harshest realities of production ANN systems. In memory algorithms like FAISS IVF and HNSW deliver their advertised sub 10 millisecond latencies only when the entire index and vectors fit in RAM. Once data spills to disk, random I/O dominates performance, and latencies can degrade by 10 to 100 times unless the system is explicitly designed for disk residency.
HNSW is particularly hostile to disk access because graph traversal requires pointer chasing with sequential dependencies. Each hop in the graph is a random read, and you cannot proceed to the next hop until the current node is fetched. Without sophisticated prefetching or block structured layouts, this causes tail latencies to balloon from 10 milliseconds to 200 plus milliseconds on SSD. IVF methods fare somewhat better because list scans are more sequential, but fetching multiple lists still incurs random seeks.
Disk optimized designs like DiskANN or hybrid IVF_HNSW_PQ architectures address this by grouping neighbors into contiguous blocks, aggressively compressing vectors with quantization, and prefetching candidate blocks in parallel. Benchmarks show a disk optimized HNSW hybrid achieving approximately 178 QPS at 0.95 recall on 1 million vectors in memory, and only a 20 percent QPS drop (approximately 142 QPS) when scaling to 3 million vectors out of memory. In contrast, naive in memory algorithms either fail entirely or see 80 to 90 percent throughput loss when paging to disk.
The decision point is cost versus latency. Keeping 500 GB of vectors in memory might cost thousands of dollars per month in cloud RAM, while SSD storage costs a fraction of that. If your application latency budget is 50 to 100 milliseconds (common when downstream LLM generation takes 500 milliseconds or more), a disk resident index at 30 to 50 milliseconds retrieval latency is perfectly acceptable and far more cost efficient. But for real time serving under 10 milliseconds, in memory is non negotiable.
💡 Key Takeaways
•In memory FAISS achieved approximately 978 QPS at 0.95 recall, while disk optimized HNSW hybrid delivered approximately 178 QPS in memory but only degraded to approximately 142 QPS (20 percent drop) when vectors exceeded RAM, showing predictable out of memory behavior
•HNSW graph traversal is particularly disk hostile due to pointer chasing and sequential dependencies, requiring sophisticated prefetching or block layouts to avoid 10 to 100 times latency degradation on SSD
•Cost trade off is significant: storing 500 GB in cloud RAM might cost thousands per month versus tens of dollars for SSD storage, making disk resident designs attractive when latency budget allows 50 to 100 milliseconds
•Hybrid strategies work well: keep hot items or PQ codes in memory (small footprint), fetch full precision vectors from SSD only for final re ranking of top candidates, amortizing I/O cost
•Disk access patterns require careful design: sequential list scans with prefetching perform far better than random vector fetches; block structured layouts and caching popular items are essential
📌 Examples
A recommendation system serving 1000 QPS with 200 millisecond total latency budget (including model inference) can use disk resident ANN at 30 to 50 milliseconds retrieval, saving 80 percent on infrastructure cost versus all in memory
DiskANN design groups graph neighbors into 4 KB SSD pages and prefetches candidate blocks in parallel, achieving approximately 10 to 20 milliseconds p99 latency on NVMe for billion scale indices
A typical failure case: naive HNSW implementation paging to disk sees latencies jump from 8 milliseconds to 200 plus milliseconds as the OS randomly fetches scattered graph nodes from SSD