Natural Language Processing Systems • Semantic Search (Dense Embeddings, ANN)Hard⏱️ ~4 min
Implementation Details: Sharding, Monitoring, and Optimization
Implementing a production semantic search system requires careful design of the embedding layer, sharding strategy, and monitoring to close the feedback loop. These details often determine whether your system scales and stays healthy under real world load.
Design the embedding layer first. Choose a dimension and model that fits your latency and memory budget. For interactive search, keep query embedding under 10 milliseconds on CPU. Common production choices are 256, 384, or 768 dimensions. Always normalize vectors to unit length if using cosine similarity; this makes cosine equivalent to dot product and simplifies index logic. For long documents, chunk text into overlapping windows of 300 tokens with 50 token overlap. This prevents single noisy segments from dominating the embedding and stabilizes recall. Store vectors alongside document IDs, chunk offsets, and filterable metadata like language, tenant ID, and access control tags. This metadata enables filtering and snippet extraction.
Sharding and replication are essential for scale and availability. Use hash based sharding by document ID for uniform load distribution. Alternatively, route by IVF coarse centroid to collocate vectors with their partitions, but this risks skew if centroids are imbalanced. Keep at least two replicas per shard for availability and to handle node failures without downtime. For memory planning, compute raw size as number of vectors times dimensions times 4 bytes for float32. For example, 100 million vectors by 384 dimensions by 4 bytes equals approximately 143 GB. With 32 byte Product Quantization codes and index overhead (graph links, posting lists, metadata), plan 6 to 10 GB per 100 million vectors per replica. A node with 64 GB RAM can comfortably serve two replicas of a 50 million vector shard. Use consistent hashing for membership changes. When adding or removing shards, only a fraction of documents need to move. Warm new shards by preloading centroids and hot posting lists before serving traffic.
Hybrid retrieval improves robustness and is widely deployed. Use a lexical prefilter to enforce must have keywords, then run ANN on the filtered candidate set. Or retrieve top K from both BM25 and dense ANN, then blend scores. Calibrate the blend weight with a small supervised model over offline labels, or use isotonic regression on click data. A cross encoder reranker over the top 50 to 200 candidates typically improves precision at 10 by 5 to 15 percent, with a 20 to 50 millisecond latency cost on CPU. Cache query embeddings and top K document IDs for head queries to save 20 to 50 percent of compute. Use short time to live (TTL) values, for example 5 to 30 minutes, to balance freshness and cache hit rate.
Monitoring closes the loop and detects degradation. Track recall at K against a periodic brute force baseline on a small held out sample. Run this hourly or daily. Track p50, p95, and p99 latency split by stage: query embedding, ANN search, reranking, and metadata fetch. Monitor index health metrics: for HNSW, track graph degree distribution and ensure it stays close to the configured M parameter; for IVF, monitor centroid load balance and posting list length distribution to detect skew; for quantization, track distortion (mean squared error between original and quantized vectors). Monitor the distribution of nearest neighbor distances over time; a shift indicates embedding drift or data distribution change. When drift or quality regressions appear, retrain the embedding model or the quantizer, backfill the corpus, and rebuild indices.
For updates, batch inserts and compaction windows to avoid fragmentation. HNSW handles online inserts well, but batching 1,000 to 10,000 inserts reduces overhead. For deletes, use tombstones and periodically compact or rebuild to reclaim memory. IVF based indices benefit from periodic full rebuilds (for example, weekly or monthly) to rebalance centroids and posting lists. Shadow new indices and A/B test before switching traffic to ensure quality and latency remain within bounds.
💡 Key Takeaways
•Embed queries in under 10 milliseconds on CPU; common dimensions are 256, 384, or 768; normalize to unit length for cosine similarity; chunk long docs to 300 tokens with 50 token overlap
•Sharding: hash by document ID for uniform load; 100M vectors at 384d is 143 GB raw, compresses to 6 to 10 GB per replica with PQ; plan 2 replicas per shard for availability
•Hybrid retrieval: BM25 prefilter plus ANN, or blend BM25 and dense scores with learned weights; cross encoder reranks top 50 to 200 for 5 to 15 percent precision gain at 20 to 50 millisecond cost
•Monitoring: track recall at K against brute force baseline hourly; monitor p95 latency by stage; track HNSW degree distribution, IVF centroid balance, and nearest neighbor distance distribution for drift
•Caching: cache query embeddings and top K IDs for head queries with 5 to 30 minute TTL, saving 20 to 50 percent compute; use consistent hashing for shard membership changes
•Updates: batch 1,000 to 10,000 inserts; use tombstones for deletes with periodic compaction; rebuild IVF indices weekly or monthly to rebalance centroids
📌 Examples
Amazon product search: 200 million products sharded into 4 shards of 50 million, each shard with 8 GB PQ index replicated twice; nodes with 64 GB RAM serve 1 replica; query fanout to all shards, merge top 100 per shard
Google monitors recall at 100 by running brute force on 10,000 query sample daily; alerts if recall drops below 0.92; retrains IVF centroids monthly to maintain balance as corpus grows
Microsoft Bing caches embeddings for top 1 million queries with 15 minute TTL, achieving 40 percent cache hit rate and reducing embedding compute by 40 percent; updates cache asynchronously on query trends