Production Dense Retrieval Pipeline: Embedding, Indexing, and Serving
Offline Indexing Pipeline
Step 1: Encode all documents into embeddings using the document encoder. For 10M documents at 768 dimensions, this produces ~30GB of float32 vectors. Processing at 100 docs/second takes ~28 hours on a single GPU; parallelize across multiple GPUs or machines. Step 2: Build an ANN (Approximate Nearest Neighbor) index for fast retrieval. HNSW indices offer best accuracy (95-99% recall); IVF indices offer better memory efficiency. Step 3: Deploy index to serving infrastructure with health checks and fallback to stale index on failures.
Online Serving Pipeline
At query time: encode query with query encoder (10-50ms depending on model size), search ANN index for top-K candidates (1-10ms), optionally re-rank with a cross-encoder for higher precision. Total latency: 20-100ms depending on model size and index configuration. Throughput: 100-1000 QPS per replica. GPU encoding is faster but more expensive; CPU works for lower-traffic applications. Cache frequent query embeddings to skip encoding.
Index Update Strategies
Full rebuild: Re-encode all documents, rebuild index from scratch. Simple but slow (hours). Good for daily or weekly updates. Incremental: Add new document embeddings to existing index without full rebuild. Faster but index quality degrades over time; periodic full rebuilds needed. Streaming: Real-time updates with specialized indices. Higher complexity but enables minute-level freshness for time-sensitive content.