Learn→ML-Powered Search & Ranking→Dense Retrieval (BERT-based Embeddings)→3 of 6

ML-Powered Search & Ranking • Dense Retrieval (BERT-based Embeddings)Medium⏱️ ~3 min

Production Dense Retrieval Pipeline: Embedding, Indexing, and Serving

Offline Indexing Pipeline
Step 1: Encode all documents into embeddings using the document encoder. For 10M documents at 768 dimensions, this produces ~30GB of float32 vectors. Processing at 100 docs/second takes ~28 hours on a single GPU; parallelize across multiple GPUs or machines. Step 2: Build an ANN (Approximate Nearest Neighbor) index for fast retrieval. HNSW indices offer best accuracy (95-99% recall); IVF indices offer better memory efficiency. Step 3: Deploy index to serving infrastructure with health checks and fallback to stale index on failures.
Online Serving Pipeline
At query time: encode query with query encoder (10-50ms depending on model size), search ANN index for top-K candidates (1-10ms), optionally re-rank with a cross-encoder for higher precision. Total latency: 20-100ms depending on model size and index configuration. Throughput: 100-1000 QPS per replica. GPU encoding is faster but more expensive; CPU works for lower-traffic applications. Cache frequent query embeddings to skip encoding.
Index Update Strategies
Full rebuild: Re-encode all documents, rebuild index from scratch. Simple but slow (hours). Good for daily or weekly updates. Incremental: Add new document embeddings to existing index without full rebuild. Faster but index quality degrades over time; periodic full rebuilds needed. Streaming: Real-time updates with specialized indices. Higher complexity but enables minute-level freshness for time-sensitive content.
💡 Scaling: Shard the index across machines. Each shard handles a subset of documents. Query broadcasts to all shards, results merge and re-rank. Sharding enables horizontal scaling to billions of documents.

💡 Key Takeaways

✓Offline: encode documents → build ANN index (HNSW 95-99% recall, IVF for memory); hours for large corpora

✓10M docs at 768 dims = ~30GB vectors; 100 docs/sec means ~28 hours on single GPU

✓Online: encode query (10-50ms) → ANN search (1-10ms) → optional cross-encoder re-rank

✓Update strategies: full rebuild (daily), incremental (degrades), streaming (minute-level freshness)

✓Shard index across machines for horizontal scaling to billions of documents

📌 Interview Tips

1Give specific latency breakdown (encode 10-50ms, search 1-10ms) for production credibility

2Mention index size calculation (10M × 768 dims = 30GB) to show capacity planning ability

3Describe three update strategies (rebuild, incremental, streaming) with trade-offs

← Back to Dense Retrieval (BERT-based Embeddings) Overview