ML-Powered Search & RankingDense Retrieval (BERT-based Embeddings)Easy⏱️ ~3 min

What is Dense Retrieval with BERT Based Embeddings?

Dense retrieval transforms search from keyword matching into a geometric problem. Instead of counting word overlaps, a transformer encoder like Bidirectional Encoder Representations from Transformers (BERT) converts both queries and documents into dense vectors in a continuous space, typically 768 dimensions for base models. Retrieval becomes finding the nearest neighbors: encode the query once, then compute similarity scores against precomputed document vectors using dot product or cosine similarity. The core advantage is semantic understanding. Dense retrieval excels when users express intent differently than documents describe content. Synonyms, paraphrases, and natural language variations become geometric proximity rather than exact string matches. A query like "laptop overheating solutions" can match documents about "cooling notebook computers" because the embeddings capture semantic similarity that keyword matching misses. Production systems use a two tower architecture with separate encoders for queries and documents. The encoders may share weights initially but can specialize during training. Document vectors are precomputed offline in batch jobs, so online latency only includes query encoding (2 to 5 milliseconds on GPU) plus approximate nearest neighbor (ANN) search (5 to 20 milliseconds per shard). This dual encoder approach trades away the accuracy of token level cross attention for sublinear retrieval speed at scale. The geometric intuition matters for system design. When vectors are L2 normalized to unit length, dot product and cosine similarity become mathematically equivalent, simplifying implementation. Training uses contrastive learning: relevant query document pairs should have high similarity scores (close in vector space) while irrelevant pairs should have low scores (far apart). This learned metric space enables retrieval over millions of candidates within tight latency budgets that make modern search responsive.
💡 Key Takeaways
BERT base models produce 768 dimensional embeddings representing text semantically, with query encoding taking 2 to 5 milliseconds on GPU when batched
Two tower architecture separates query and document encoders, allowing document vectors to be precomputed offline and stored in an index for fast retrieval
Approximate nearest neighbor search over tens of millions of vectors completes in 5 to 20 milliseconds per shard, enabling sublinear retrieval scaling
L2 normalizing vectors makes dot product equivalent to cosine similarity, simplifying implementation and enabling efficient similarity computation
Dense retrieval excels at semantic matching across paraphrases and synonyms but can fail on exact keyword matches like product SKUs without hybrid augmentation
📌 Examples
Google uses learned dual encoders with ScaNN style ANN acceleration for semantic retrieval in web search and internal products, serving at scale with tens of milliseconds latency
Meta open sourced Dense Passage Retrieval (DPR) which retrieves from 21 million Wikipedia passages using FAISS based ANN indices on GPU backed services
Amazon e-commerce uses two tower models for query to product matching at tens of thousands of queries per second (QPS) with sub 100 millisecond P95 latency
← Back to Dense Retrieval (BERT-based Embeddings) Overview
What is Dense Retrieval with BERT Based Embeddings? | Dense Retrieval (BERT-based Embeddings) - System Overflow