Embeddings & Similarity SearchEmbedding Quality EvaluationMedium⏱️ ~2 min

MTEB and BEIR Benchmark Evaluation

Benchmarks bring comparability across tasks and domains, allowing teams to evaluate embedding models before committing to expensive production integration. Massive Text Embedding Benchmark (MTEB) evaluates models on eight task types including retrieval, clustering, semantic textual similarity (STS), classification, reranking, and pair classification. It aggregates results with task appropriate metrics: Spearman correlation for similarity tasks, Recall at K and nDCG for retrieval, and normalized mutual information (NMI) for clustering. MTEB covers 58 datasets spanning 112 languages, providing broad coverage. BEIR (Benchmarking Information Retrieval) addresses a critical gap: domain robustness. It covers 18 diverse retrieval datasets that stress different query types, domains, and document characteristics. These include scientific papers (SCIDOCS), question answering (Natural Questions), fact verification (FEVER), news (TREC NEWS), biomedical (TREC COVID), and argumentative search. The datasets deliberately exclude common training corpora to test true generalization. A model might score 85 nDCG at 10 on standard web search but drop to 45 on biomedical queries, revealing brittleness. Strong practice evaluates across multiple axes rather than chasing a single global score. Google and Meta stratify by slice: question length (short under 3 tokens vs long over 12 tokens), language, domain (news vs shopping vs reference), and time (queries from last month vs last year to detect drift). Spotify segments by track popularity (head vs tail) and user tenure (new vs established listeners). This reveals where models fail and guides targeted improvements. Real production decisions layer benchmarks with domain specific evaluation. A team might use MTEB as a fast screening tool, filtering out models below 60 on retrieval tasks, then run BEIR to check robustness, and finally evaluate on 50,000 labeled query document pairs from production logs with editorial judgments before committing to backfill and A/B testing.
💡 Key Takeaways
MTEB evaluates eight task types (retrieval, clustering, STS, classification) across 58 datasets and 112 languages with task specific metrics like Spearman and nDCG
BEIR tests domain robustness across 18 retrieval datasets (scientific, medical, news, QA) deliberately excluding common training data to measure true generalization
Models can show 40 point nDCG gaps between domains: 85 on web search but 45 on biomedical queries, revealing brittleness that overall scores hide
Production teams stratify by slice (query length, language, domain, time) to detect localized failures and guide targeted improvements
Layered evaluation uses MTEB for fast screening (filter below 60 on retrieval), BEIR for robustness check, then domain specific sets with 50k labeled pairs before rollout
📌 Examples
A model scoring 82 overall on MTEB might show 88 on web search, 76 on scientific papers, and 52 on legal documents, guiding domain specific fine tuning
BEIR reveals that a model trained on MS MARCO drops 25 nDCG points on TREC COVID biomedical queries, prompting addition of domain data
Google stratifies search evaluation by query length: short queries under 3 tokens, medium 3 to 12 tokens, long over 12 tokens, revealing that models underperform on compositional long queries
← Back to Embedding Quality Evaluation Overview
MTEB and BEIR Benchmark Evaluation | Embedding Quality Evaluation - System Overflow