MTEB and BEIR Benchmark Evaluation
WHAT ARE EMBEDDING BENCHMARKS
MTEB (Massive Text Embedding Benchmark) and BEIR (Benchmarking Information Retrieval) are standardized test suites that evaluate embedding models across many tasks. Instead of testing on one dataset, they aggregate performance across 56+ tasks (MTEB) or 18+ retrieval datasets (BEIR).
Why benchmarks matter: a model that excels on one dataset might fail on others. Benchmarks reveal how well embeddings generalize across domains, languages, and task types.
MTEB STRUCTURE
MTEB evaluates embeddings on 8 task categories: semantic textual similarity, classification, clustering, reranking, retrieval, pair classification, and summarization. Each category has multiple datasets with different characteristics.
The aggregate MTEB score is a weighted average across all tasks. A model scoring 65/100 overall might score 75 on retrieval but only 55 on classification. Check individual task scores relevant to your use case—the aggregate can hide weaknesses.
BEIR FOR RETRIEVAL
BEIR focuses specifically on zero-shot retrieval—can an embedding model retrieve relevant documents from domains it has never seen during training? Datasets span scientific papers, financial documents, COVID-19 research, and more.
Key metrics: NDCG@10 (normalized discounted cumulative gain at rank 10) measures how well the model ranks relevant documents at the top. An NDCG@10 of 0.45 on BEIR is considered competitive; state-of-art models reach 0.50-0.55.
BENCHMARK LIMITATIONS
Benchmark datasets may not represent your production data. A model that tops MTEB leaderboard might underperform on your specific domain. Always validate on your own held-out data after benchmark screening.