Choosing the Right Index: Decision Framework and Capacity Planning

CHOOSING AN INDEX TYPE
Under 1 million vectors: Flat index (exact search) may be fast enough. Benchmark before adding complexity. 1 to 100 million vectors: HNSW if latency critical (under 10ms) and vectors fit in RAM. 100 million to 1 billion: IVF-PQ for memory efficiency or sharded HNSW across multiple machines. Over 1 billion: Distributed solutions with sharding, typically IVF-PQ with disk storage.
CAPACITY PLANNING
Calculate memory needs before choosing infrastructure. HNSW: vectors + graph = approximately 1.5x raw vector size. For 100M 128 dim vectors: 100M × 128 × 4 bytes × 1.5 = 77 GB RAM needed per replica. IVF-PQ: roughly 8 to 16 bytes per vector. 100M vectors = 0.8 to 1.6 GB. Add replicas for query throughput: if one replica handles 100 QPS and you need 500 QPS, deploy 5 replicas.
✅ Best Practice: Always benchmark with your actual data. Published benchmarks use synthetic data with uniform distribution. Real data has clusters and outliers that change performance significantly.
WHEN TO SHARD
Shard when single machine cannot hold the index or cannot serve required throughput. Sharding strategies: by vector ID range (simple but hot spots possible), by cluster (IVF clusters map to shards), or random (uniform load but complex routing). Sharding adds coordination overhead: query fans out to all shards, results merge. Expect 2 to 5ms overhead per shard added.
LIBRARIES AND TOOLS
FAISS provides IVF, PQ, and flat indexes with GPU acceleration. Hnswlib is a lightweight HNSW implementation. ScaNN optimizes for modern CPUs with SIMD instructions. Milvus and Pinecone provide managed services with sharding and replication built in. Start with FAISS for prototyping, move to managed services for production if operational complexity is a concern.

💡 Key Takeaways

✓Under 1M: flat index may suffice. 1-100M: HNSW if fits in RAM. 100M-1B: IVF-PQ. Over 1B: sharded

✓HNSW memory: 1.5x raw vector size. 100M × 128 × 4 bytes × 1.5 = 77 GB per replica

✓IVF-PQ memory: 8-16 bytes per vector. 100M vectors = 0.8-1.6 GB per replica

✓Sharding adds 2-5ms overhead per shard; shard by cluster for natural load distribution

✓FAISS for prototyping, managed services (Milvus, Pinecone) for production operational simplicity

📌 Interview Tips

1Walk through capacity: 500 QPS target, 100 QPS per replica → 5 replicas needed

2Explain sharding decision: single 77 GB replica vs two 40 GB shards with fan-out overhead

3Discuss benchmarking: always test with real data, not synthetic uniform distribution

← Back to Scalability (ANN, HNSW, FAISS) Overview