Embeddings & Similarity SearchEmbedding Quality EvaluationMedium⏱️ ~2 min

MTEB and BEIR Benchmark Evaluation

WHAT ARE EMBEDDING BENCHMARKS

MTEB (Massive Text Embedding Benchmark) and BEIR (Benchmarking Information Retrieval) are standardized test suites that evaluate embedding models across many tasks. Instead of testing on one dataset, they aggregate performance across 56+ tasks (MTEB) or 18+ retrieval datasets (BEIR).

Why benchmarks matter: a model that excels on one dataset might fail on others. Benchmarks reveal how well embeddings generalize across domains, languages, and task types.

MTEB STRUCTURE

MTEB evaluates embeddings on 8 task categories: semantic textual similarity, classification, clustering, reranking, retrieval, pair classification, and summarization. Each category has multiple datasets with different characteristics.

The aggregate MTEB score is a weighted average across all tasks. A model scoring 65/100 overall might score 75 on retrieval but only 55 on classification. Check individual task scores relevant to your use case—the aggregate can hide weaknesses.

BEIR FOR RETRIEVAL

BEIR focuses specifically on zero-shot retrieval—can an embedding model retrieve relevant documents from domains it has never seen during training? Datasets span scientific papers, financial documents, COVID-19 research, and more.

Key metrics: NDCG@10 (normalized discounted cumulative gain at rank 10) measures how well the model ranks relevant documents at the top. An NDCG@10 of 0.45 on BEIR is considered competitive; state-of-art models reach 0.50-0.55.

BENCHMARK LIMITATIONS

Benchmark datasets may not represent your production data. A model that tops MTEB leaderboard might underperform on your specific domain. Always validate on your own held-out data after benchmark screening.

⚠️ Key Trade-off: Benchmark-optimized models may overfit to benchmark distributions. Use benchmarks for initial model selection, then fine-tune and validate on your domain data.
💡 Key Takeaways
MTEB: 56+ tasks across 8 categories; aggregate score hides per-task weaknesses
BEIR: zero-shot retrieval across 18+ domains; NDCG@10 of 0.45+ is competitive
Benchmarks screen models but do not replace domain-specific validation
📌 Interview Tips
1Interview Tip: Explain why aggregate MTEB score is misleading—a model scoring 65 overall might score 55 on your specific task.
2Interview Tip: Describe the limitation—benchmark leaders may underperform on your domain, so always validate locally.
← Back to Embedding Quality Evaluation Overview
MTEB and BEIR Benchmark Evaluation | Embedding Quality Evaluation - System Overflow