Choosing the Right Index: Decision Framework and Capacity Planning
CHOOSING AN INDEX TYPE
Under 1 million vectors: Flat index (exact search) may be fast enough. Benchmark before adding complexity. 1 to 100 million vectors: HNSW if latency critical (under 10ms) and vectors fit in RAM. 100 million to 1 billion: IVF-PQ for memory efficiency or sharded HNSW across multiple machines. Over 1 billion: Distributed solutions with sharding, typically IVF-PQ with disk storage.
CAPACITY PLANNING
Calculate memory needs before choosing infrastructure. HNSW: vectors + graph = approximately 1.5x raw vector size. For 100M 128 dim vectors: 100M × 128 × 4 bytes × 1.5 = 77 GB RAM needed per replica. IVF-PQ: roughly 8 to 16 bytes per vector. 100M vectors = 0.8 to 1.6 GB. Add replicas for query throughput: if one replica handles 100 QPS and you need 500 QPS, deploy 5 replicas.
WHEN TO SHARD
Shard when single machine cannot hold the index or cannot serve required throughput. Sharding strategies: by vector ID range (simple but hot spots possible), by cluster (IVF clusters map to shards), or random (uniform load but complex routing). Sharding adds coordination overhead: query fans out to all shards, results merge. Expect 2 to 5ms overhead per shard added.
LIBRARIES AND TOOLS
FAISS provides IVF, PQ, and flat indexes with GPU acceleration. Hnswlib is a lightweight HNSW implementation. ScaNN optimizes for modern CPUs with SIMD instructions. Milvus and Pinecone provide managed services with sharding and replication built in. Start with FAISS for prototyping, move to managed services for production if operational complexity is a concern.