Embeddings & Similarity SearchEmbedding Generation (BERT, Sentence-BERT, Graph Embeddings)Medium⏱️ ~2 min

BERT vs Sentence BERT: Token Context vs Sentence Similarity

BERT excels at token level understanding but struggles with sentence similarity tasks. Each token receives a contextual embedding by attending to all other tokens in both directions through transformer layers. This makes BERT powerful for named entity recognition, question answering, and span extraction where you need precise token representations. However, obtaining a single sentence vector requires pooling strategies, and naive approaches like using the classification token or mean pooling often produce vectors that poorly align with semantic similarity. Sentence BERT solves this by training a twin network architecture with contrastive objectives specifically for sentence embeddings. The model processes two sentences independently through shared BERT weights, then applies a pooling layer to produce fixed size vectors. During training, the network learns to place semantically similar sentences close together and dissimilar ones far apart in vector space using triplet loss or cosine similarity objectives. This alignment is critical: after SBERT training, cosine similarity between sentence vectors directly correlates with human judgments of semantic similarity. The performance difference is dramatic in production retrieval scenarios. With raw BERT, comparing two sentences requires concatenating them and running a forward pass through the full model, taking 50 to 150 milliseconds on CPU per pair. For 100 million documents, this approach is impossible. SBERT computes each sentence embedding once in 2 to 10 milliseconds on GPU, stores it, and later compares via dot product in microseconds. This enables sub 20 millisecond retrieval at scale. When to choose which matters for system architecture. Use BERT when you need token level outputs, have a downstream cross encoder that processes pairs jointly, or perform tasks like slot filling where precise token context matters. Use SBERT for semantic search, clustering, deduplication, and candidate retrieval where you need fast vector comparison across large corpora. Many production systems use both: SBERT retrieves 500 to 2000 candidates in under 20 milliseconds, then a BERT based cross encoder re-ranks the top 100 to 200 in 30 to 80 milliseconds for higher precision.
💡 Key Takeaways
BERT produces token embeddings with bidirectional context but requires pooling to get sentence vectors, and naive pooling underperforms on semantic similarity by 15 to 30 percent compared to SBERT
Sentence BERT uses twin networks and contrastive loss (triplet or cosine objectives) to align vector space with semantic similarity, enabling precomputation and fast retrieval
SBERT enables 10,000x faster comparison: compute embeddings once at 2 to 10 milliseconds per sentence on GPU, then compare via microsecond dot products instead of 50 to 150 millisecond BERT forward passes per pair
Production systems use two stage pipelines: SBERT retrieves 500 to 2000 candidates in under 20 milliseconds, then BERT based cross encoders re-rank top 100 to 200 in 30 to 80 milliseconds
MiniLM and distilled SBERT variants reduce parameters from 110 million to 22 to 33 million, achieving 2 to 5 millisecond latency on GPU with minimal quality loss for sentence tasks
Choose BERT for token level tasks like named entity recognition and question answering span extraction, SBERT for semantic search, clustering, and deduplication at scale
📌 Examples
Semantic search system: SBERT encodes 100 million documents offline into 384 dimensional vectors stored in Hierarchical Navigable Small World (HNSW) index, query embedding computed in 5 milliseconds, retrieval returns top 1000 in 15 milliseconds
Question answering pipeline: SBERT retrieves relevant passages in 20 milliseconds, BERT cross encoder scores 200 candidates in 60 milliseconds to extract precise answer spans
Google uses two tower retrieval where query tower and document tower produce SBERT style embeddings independently, enabling precomputation on billions of documents with real time query encoding
← Back to Embedding Generation (BERT, Sentence-BERT, Graph Embeddings) Overview
BERT vs Sentence BERT: Token Context vs Sentence Similarity | Embedding Generation (BERT, Sentence-BERT, Graph Embeddings) - System Overflow