Natural Language Processing Systems • Semantic Search (Dense Embeddings, ANN)Hard⏱️ ~3 min
Critical Trade Offs: Accuracy, Latency, Memory, and Freshness
Designing a semantic search system requires navigating fundamental trade offs. Each decision trades one resource or quality dimension for another, and the right choice depends on business constraints like query volume, corpus size, update rate, and budget.
Accuracy versus latency is the most visible trade off. Higher recall requires more work. In IVF, you probe more partitions (nprobe parameter); probing 1 percent of 16,384 partitions means checking 164 partitions, while probing 5 percent checks 819. This increases recall from perhaps 88 percent to 95 percent, but CPU time increases proportionally. In HNSW, the efSearch parameter controls how many candidates the graph search considers; increasing it from 100 to 500 might improve recall from 92 to 97 percent but can triple query latency from 3 to 9 milliseconds. Cross encoder reranking improves precision at 10 by 5 to 15 percent but adds 20 to 100 milliseconds. You must tie these numbers to business metrics. If downstream click through rate (CTR) is your goal, test whether recall at 50 above 0.9 is sufficient, because a learned ranker will reorder the candidates anyway.
Memory versus quality is critical at scale. Product Quantization compresses vectors by 50 to 100 times, enabling hundreds of millions of items to fit in RAM per node. For example, compressing 768 float32 dimensions from 3,072 bytes to 32 bytes with 8 bit codes allows a 200 million vector index to use 10 to 20 GB instead of 600 GB. The cost is quantization error, typically 1 to 5 percent recall loss at common settings. HNSW keeps full precision vectors, which improves recall by 2 to 5 percent over quantized methods, but memory scales linearly with corpus size and limits per node capacity to 10 to 50 million vectors depending on hardware. The choice depends on whether you value memory efficiency or maximum quality. If you can afford the RAM, HNSW is simpler and more accurate. If you need to fit 100 million vectors per shard, quantization is necessary.
Freshness versus index optimality presents a different challenge. IVF and Product Quantization require a training phase using k means and codebook learning, and they work best with batch rebuilds because inserting new vectors without retraining degrades cluster balance. HNSW supports online inserts and deletes well, making it suitable for frequently updated catalogs, but even HNSW benefits from periodic rebuilds because long running graphs accumulate suboptimal links. The decision hinges on update rate. If your catalog has 1 percent churn per hour (for example, news articles or social posts), HNSW or a hybrid tiering approach (recent items in HNSW, stable corpus in IVF) works well. If you rebuild nightly and updates are batched, IVF with Product Quantization is more memory efficient.
CPU versus GPU and model generality versus domain adaptation are also key. GPUs excel at batched dot products and can handle brute force search for a few million vectors, or serve as rerankers processing hundreds of candidates in parallel. CPUs with SIMD and cache friendly indices like HNSW deliver lower latency on single query streams and are more cost effective at a few thousand QPS. Mixed deployments are common: CPU for the ANN index, GPU for the cross encoder reranking stage. On the modeling side, out of the box sentence encoders get you started quickly, but fine tuning on in domain click logs or labeled pairs typically improves offline NDCG by 10 to 20 percent. This requires building a label pipeline and offline evaluation harness, and you must monitor for bias amplification. The trade off is development cost versus quality.
💡 Key Takeaways
•Accuracy vs latency: Increasing IVF nprobe from 1 to 5 percent of partitions raises recall 88 to 95 percent but increases CPU time 5 times; HNSW efSearch 100 to 500 triples latency for 5 percent recall gain
•Memory vs quality: Product Quantization compresses 768d float32 from 3,072 bytes to 32 bytes, enabling 200M vectors in 10 to 20 GB but losing 1 to 5 percent recall; HNSW uses full precision and 2 times memory for 2 to 5 percent better recall
•Freshness vs index: IVF PQ needs batch rebuild for optimal cluster balance, suited for nightly updates; HNSW supports online inserts for 1 percent per hour churn but benefits from periodic compaction
•CPU vs GPU: CPUs with HNSW deliver 1 to 5 millisecond single query latency; GPUs handle batched dot products at tens of thousands QPS or rerank top 100 candidates in parallel for 20 to 40 milliseconds
•Model generality vs domain adaptation: Fine tuning on in domain click logs improves NDCG by 10 to 20 percent but requires label pipelines and bias monitoring; generic encoders have faster time to production
•Business metric alignment: Tie recall targets to downstream CTR or conversion; recall at 50 above 0.9 may suffice if a learned ranker reorders candidates
📌 Examples
E-commerce search: Use IVF PQ for 100 million products with nightly catalog updates, nprobe tuned to 95 percent recall at 100; rerank top 100 with cross encoder for 50 millisecond p95 end to end
Social media feed: Use HNSW for 20 million recent posts with continuous ingestion of 100K posts per hour; dual index strategy with HNSW for last 7 days, IVF PQ for historical archive
News search: Fine tune embedding model on 1 million click pairs from last 30 days, improving NDCG from 0.72 to 0.81; retrain weekly to capture trending topics and reduce model staleness