Intrinsic vs Extrinsic Evaluation Trade-offs
INTRINSIC EVALUATION METHODS
Intrinsic evaluation measures embedding properties without running the full downstream task. It is fast and useful for debugging but can be misleading about real performance.
Alignment: Measure correlation between embedding distances and known similarity labels. If experts label pairs as similar/dissimilar, do embeddings agree? Spearman correlation of 0.7+ indicates reasonable alignment.
Uniformity: Are embeddings spread evenly across the sphere, or do they collapse into a small region? Collapsed embeddings lack expressiveness—all items look similar. Measure via average pairwise distance; uniform distribution has higher average.
Cluster quality: If ground-truth clusters exist (categories, topics), do embedding clusters match? Silhouette score measures cluster coherence. Score above 0.3 indicates meaningful clustering.
EXTRINSIC EVALUATION METHODS
Extrinsic evaluation runs the actual downstream task on held-out data. Slower but directly measures what you care about.
Retrieval: Given query, can embeddings retrieve correct documents? Measure recall@K and NDCG@K. This is the ground truth for search use cases.
Classification: Train a simple classifier (linear probe) on top of frozen embeddings. If embeddings capture class-relevant information, the probe achieves high accuracy. Accuracy below 70% suggests embeddings miss important signals.
WHEN INTRINSIC AND EXTRINSIC DISAGREE
Sometimes intrinsic metrics look good but extrinsic fails. This happens when embeddings capture structure but not the structure relevant to your task. A model might cluster by topic (good alignment) but miss user intent (bad retrieval).
Resolution: trust extrinsic metrics for final decisions. Use intrinsic metrics only for debugging why extrinsic is failing.