Learn→Embeddings & Similarity Search→Embedding Generation (BERT, Sentence-BERT, Graph Embeddings)→2 of 6

Embeddings & Similarity Search • Embedding Generation (BERT, Sentence-BERT, Graph Embeddings)Medium⏱️ ~2 min

BERT vs Sentence BERT: Token Context vs Sentence Similarity

BERT: TOKEN-LEVEL UNDERSTANDING
BERT (Bidirectional Encoder Representations from Transformers) processes text by looking at each word in context of all other words. For the word "bank" in "river bank" vs "bank account," BERT produces different vector representations because the surrounding words differ. This context-awareness enables nuanced language understanding.
Problem for similarity search: BERT does not naturally produce sentence-level embeddings. The common workaround—averaging all token embeddings—loses sentence meaning. Two unrelated sentences might have similar averages by chance because averaging destroys ordering and emphasis.
SENTENCE-BERT: OPTIMIZED FOR SIMILARITY
Sentence-BERT (SBERT) fine-tunes BERT specifically for sentence similarity. During training, the model sees pairs of similar sentences and learns to produce vectors that are close together. The special [CLS] token or mean pooling produces a single sentence embedding optimized for cosine similarity comparison.
Key difference: SBERT embeddings are directly comparable via cosine distance. Two semantically similar sentences have cosine similarity near 1.0; unrelated sentences have similarity near 0.0. No cross-attention needed between sentences.
PERFORMANCE COMPARISON
Cross-encoder (BERT): Feed both sentences together, model attends across both. Most accurate but O(N²) for N documents—1000 sentences = 500K pair evaluations. Unusable for search at scale.
Bi-encoder (SBERT): Embed each sentence independently, compare via dot product. O(N) embeddings + O(1) per comparison. 1000 sentences = 1000 embeddings, then use ANN for sub-linear search. Slightly less accurate but 10,000x faster.
⚠️ Key Trade-off: SBERT is faster but 2-5% less accurate than cross-encoder BERT. Common pattern: SBERT retrieves 100 candidates fast (5ms), cross-encoder re-ranks top 10 for maximum precision (50ms).

💡 Key Takeaways

✓BERT: token-level context but no native sentence embedding

✓SBERT: fine-tuned for sentence similarity, directly comparable vectors

✓SBERT is O(N) for embedding; cross-encoder BERT is O(N²) for comparison

📌 Interview Tips

1Interview Tip: Explain the bi-encoder vs cross-encoder tradeoff—speed vs accuracy, use both in two-stage retrieval.

2Interview Tip: Describe when to use each—SBERT for candidate generation, cross-encoder for re-ranking.

← Back to Embedding Generation (BERT, Sentence-BERT, Graph Embeddings) Overview