ML-Powered Search & RankingDense Retrieval (BERT-based Embeddings)Medium⏱️ ~3 min

Production Dense Retrieval Pipeline: Embedding, Indexing, and Serving

A production dense retrieval system operates across three planes: offline embedding and indexing, online query serving, and continuous monitoring. Each plane has distinct performance requirements and failure modes that shape system architecture. Offline embedding processes documents in large batches. Chunking comes first because transformer models have context window limits. Production systems typically use 200 to 300 token passages with 32 to 64 token overlap between chunks to handle concepts that span boundaries. A single modern GPU can encode 2,000 to 5,000 passages per second for BERT base sized models at batch size 128. For 100 million documents, a 10 GPU cluster completes full embedding in 6 to 12 hours. Each 768 dimensional float32 vector consumes about 3 kilobytes, so 100 million passages require 300 gigabytes of raw vector storage before compression. ANN index construction balances memory, latency, and recall. Hierarchical Navigable Small World (HNSW) graphs offer strong recall with query latencies under 10 milliseconds for 10 million vectors per shard, but consume 1.5 to 2 times raw vector memory. Inverted file with product quantization (IVF PQ) compresses vectors to 64 to 128 bytes each, reducing 100 million float16 vectors from 147 gigabytes to under 13 gigabytes, enabling single machine deployment. The trade off is a typical 2 to 5 point recall at 100 drop. Sharding keeps per shard sizes under 30 million vectors to bound tail latency, with stable hash partitioning by document id. Online serving has a tight latency budget. Query encoding takes 2 to 5 milliseconds on GPU with micro batching, or 10 to 30 milliseconds on optimized CPU with vector instructions. ANN search across 10 shards with top 100 per shard takes 5 to 20 milliseconds per shard in parallel. A broker merges results and deduplicates within 5 to 10 milliseconds. Cross encoder re ranking of top 50 to 200 candidates adds 20 to 80 milliseconds on a T4 class GPU. Total P95 latency budgets are typically under 150 milliseconds. This multi stage cascade is standard: retrieve 1,000 candidates with ANN, re rank 200 with a fast transformer, return top 20. Capacity planning requires concrete math. At 5,000 QPS with 10 shards, each shard serves 500 QPS. With 15 millisecond P95 ANN search and concurrency of 20, a single shard server handles the load. Add 2 times replication for availability. For query encoding, batching 10 queries together at 3 milliseconds per batch yields 3,000 QPS per GPU. Three GPUs with headroom serve 5,000 QPS at target P95.
💡 Key Takeaways
Chunking at 200 to 300 tokens with 32 to 64 token overlap handles transformer context limits while preserving boundary concepts, creating manageable passage sizes
Offline embedding throughput of 2,000 to 5,000 passages per second per GPU enables 100 million documents in 6 to 12 hours with a 10 GPU cluster
Product quantization compresses 768 dimensional float16 vectors from 1.5 kilobytes to 64 to 128 bytes, reducing 100 million vector index from 147 gigabytes to under 13 gigabytes with 2 to 5 point recall drop
Multi stage cascade retrieves 1,000 candidates with fast ANN, then re ranks top 200 with expensive cross encoder, optimizing the cost to quality ratio
Sharding by document id hash keeps per shard size under 30 million vectors, bounding tail latency and enabling horizontal scaling
📌 Examples
Amazon product search uses 10 shards serving 500 QPS each at 5,000 total QPS, with 15 millisecond P95 ANN search and 2 times replication for 99.9% availability
Meta FAISS based retrieval over 21 million passages uses IVF with product quantization, fitting the entire index in 3 gigabytes for fast single machine serving
Microsoft Bing encodes queries on CPU with AVX512 vector instructions achieving 10 to 30 milliseconds per query, avoiding GPU dependency for cost efficiency at high QPS
← Back to Dense Retrieval (BERT-based Embeddings) Overview
Production Dense Retrieval Pipeline: Embedding, Indexing, and Serving | Dense Retrieval (BERT-based Embeddings) - System Overflow