Natural Language Processing SystemsSemantic Search (Dense Embeddings, ANN)Hard⏱️ ~4 min

Implementation Details: Sharding, Monitoring, and Optimization

Index Management

Vector indices need maintenance. New documents require embedding and insertion. Deleted documents leave dead entries. Updated documents need old vectors removed and new ones added. Index building is expensive - a million vectors with HNSW takes 10-30 minutes. Some systems support online updates while serving queries; others require downtime.

Filtering and Metadata

Pure semantic search ignores structure. But users often want semantic search within constraints: "find similar products in electronics under ." Efficient filtering requires metadata support before or during ANN search. Pre-filtering works for selective filters. Post-filtering works when most documents pass the filter.

💡 Key Insight: Filtering to 1% of docs before ANN may hurt recall - the index was built on full corpus and may not navigate efficiently to filtered subsets. Test filtered query performance explicitly.

Monitoring and Debugging

Semantic search failures are hard to debug - no obvious "wrong answer." Monitor query latency (P50, P99), click-through rates, and no-click rates. Log query and result vectors to diagnose poor matches. Visualizing embeddings with dimensionality reduction (t-SNE, UMAP) reveals clustering problems.

Scaling Patterns

Beyond single-node capacity, shard vectors across machines by partition. Query all shards in parallel, merge results. This adds latency but enables arbitrary scale. Some workloads benefit from replicas: multiple copies of the same shard serving read traffic for higher throughput.

💡 Key Takeaways
Index building is expensive (10-30 min for 1M vectors with HNSW) - understand update semantics and plan migrations
Metadata filtering affects recall: filtering to 1% of docs before ANN may miss results since index was built on full corpus
Debug with click-through and no-click rates - visualize embeddings with t-SNE/UMAP to reveal clustering problems
Scale by sharding vectors across machines, query all shards in parallel, merge results - adds latency but enables arbitrary scale
📌 Interview Tips
1Explain pre-filter vs post-filter: selective filters (1% of docs) work better with post-filter to avoid ANN navigation issues.
2For debugging, recommend logging query and result vectors. Visualizing with UMAP can show if queries land in wrong neighborhoods.
3Describe the sharding pattern: partition by document ID ranges, parallel query, merge. Trade latency for scale.
← Back to Semantic Search (Dense Embeddings, ANN) Overview
Implementation Details: Sharding, Monitoring, and Optimization | Semantic Search (Dense Embeddings, ANN) - System Overflow