BERT vs Sentence BERT: Token Context vs Sentence Similarity
BERT: TOKEN-LEVEL UNDERSTANDING
BERT (Bidirectional Encoder Representations from Transformers) processes text by looking at each word in context of all other words. For the word "bank" in "river bank" vs "bank account," BERT produces different vector representations because the surrounding words differ. This context-awareness enables nuanced language understanding.
Problem for similarity search: BERT does not naturally produce sentence-level embeddings. The common workaround—averaging all token embeddings—loses sentence meaning. Two unrelated sentences might have similar averages by chance because averaging destroys ordering and emphasis.
SENTENCE-BERT: OPTIMIZED FOR SIMILARITY
Sentence-BERT (SBERT) fine-tunes BERT specifically for sentence similarity. During training, the model sees pairs of similar sentences and learns to produce vectors that are close together. The special [CLS] token or mean pooling produces a single sentence embedding optimized for cosine similarity comparison.
Key difference: SBERT embeddings are directly comparable via cosine distance. Two semantically similar sentences have cosine similarity near 1.0; unrelated sentences have similarity near 0.0. No cross-attention needed between sentences.
PERFORMANCE COMPARISON
Cross-encoder (BERT): Feed both sentences together, model attends across both. Most accurate but O(N²) for N documents—1000 sentences = 500K pair evaluations. Unusable for search at scale.
Bi-encoder (SBERT): Embed each sentence independently, compare via dot product. O(N) embeddings + O(1) per comparison. 1000 sentences = 1000 embeddings, then use ANN for sub-linear search. Slightly less accurate but 10,000x faster.