Embeddings & Similarity SearchEmbedding Generation (BERT, Sentence-BERT, Graph Embeddings)Medium⏱️ ~2 min

BERT vs Sentence BERT: Token Context vs Sentence Similarity

BERT: TOKEN-LEVEL UNDERSTANDING

BERT (Bidirectional Encoder Representations from Transformers) processes text by looking at each word in context of all other words. For the word "bank" in "river bank" vs "bank account," BERT produces different vector representations because the surrounding words differ. This context-awareness enables nuanced language understanding.

Problem for similarity search: BERT does not naturally produce sentence-level embeddings. The common workaround—averaging all token embeddings—loses sentence meaning. Two unrelated sentences might have similar averages by chance because averaging destroys ordering and emphasis.

SENTENCE-BERT: OPTIMIZED FOR SIMILARITY

Sentence-BERT (SBERT) fine-tunes BERT specifically for sentence similarity. During training, the model sees pairs of similar sentences and learns to produce vectors that are close together. The special [CLS] token or mean pooling produces a single sentence embedding optimized for cosine similarity comparison.

Key difference: SBERT embeddings are directly comparable via cosine distance. Two semantically similar sentences have cosine similarity near 1.0; unrelated sentences have similarity near 0.0. No cross-attention needed between sentences.

PERFORMANCE COMPARISON

Cross-encoder (BERT): Feed both sentences together, model attends across both. Most accurate but O(N²) for N documents—1000 sentences = 500K pair evaluations. Unusable for search at scale.

Bi-encoder (SBERT): Embed each sentence independently, compare via dot product. O(N) embeddings + O(1) per comparison. 1000 sentences = 1000 embeddings, then use ANN for sub-linear search. Slightly less accurate but 10,000x faster.

⚠️ Key Trade-off: SBERT is faster but 2-5% less accurate than cross-encoder BERT. Common pattern: SBERT retrieves 100 candidates fast (5ms), cross-encoder re-ranks top 10 for maximum precision (50ms).
💡 Key Takeaways
BERT: token-level context but no native sentence embedding
SBERT: fine-tuned for sentence similarity, directly comparable vectors
SBERT is O(N) for embedding; cross-encoder BERT is O(N²) for comparison
📌 Interview Tips
1Interview Tip: Explain the bi-encoder vs cross-encoder tradeoff—speed vs accuracy, use both in two-stage retrieval.
2Interview Tip: Describe when to use each—SBERT for candidate generation, cross-encoder for re-ranking.
← Back to Embedding Generation (BERT, Sentence-BERT, Graph Embeddings) Overview
BERT vs Sentence BERT: Token Context vs Sentence Similarity | Embedding Generation (BERT, Sentence-BERT, Graph Embeddings) - System Overflow