What Is Embedding Quality Evaluation?
WHY EVALUATION MATTERS
Embeddings can look reasonable in visualization tools but fail in production. Two items might be close in embedding space but completely unrelated for your use case. Evaluation catches these failures before they affect users.
The core question: do similar items (as defined by your labels, clicks, or purchases) have similar embeddings? If the correlation is weak, your retrieval will surface irrelevant results regardless of how sophisticated your ANN index is.
INTRINSIC VS EXTRINSIC METRICS
Intrinsic metrics: Measure embedding properties directly. Clustering quality, alignment with known similarity labels, distance statistics. Fast to compute, useful for debugging. Examples: silhouette score, alignment@k.
Extrinsic metrics: Measure performance on actual downstream tasks. Retrieval recall, classification accuracy, recommendation CTR. Slower to compute, but directly measures what you care about. Examples: NDCG@10, recall@100.
Rule of thumb: intrinsic metrics for fast iteration during development, extrinsic metrics for final decisions and production monitoring.
KEY METRICS FOR RETRIEVAL
Recall@K: What fraction of true relevant items appear in top K results? recall@100 = 0.90 means 90% of relevant items are in the first 100 candidates. Critical for two-stage retrieval where Stage 2 cannot fix Stage 1 misses.
NDCG@K: Measures ranking quality—are relevant items ranked at the top? Accounts for position: item at rank 1 matters more than rank 100. NDCG@10 = 0.85 is good; below 0.7 indicates ranking problems.