Failure Modes: False Negatives and Label Noise
FALSE NEGATIVES
The biggest risk in hard negative mining: treating actual positives as negatives. If item B is truly relevant to query A but unlabeled, the model learns to push B away. This directly harms recall.
Symptoms: recall drops after adding hard negative mining. Model confidently ranks some relevant items at the bottom. Users complain about obvious matches not appearing.
Causes: incomplete labeling (most relevant pairs are not explicitly labeled), label noise (human annotators disagree or make errors), distribution shift (new items have no labels).
Detection: sample hard negatives and manually review. If >5% are actually relevant, your false negative rate is too high. Also monitor recall on a clean held-out test set—if it drops after mining, investigate.
MITIGATION STRATEGIES
Confidence filtering: Only use negatives where the model is confident (high distance from anchor). Avoid the hardest negatives which are likely false negatives.
Cross-validation: Train multiple models, only use negatives that all models agree are negative. Ensemble agreement reduces single-model errors.
Soft labels: Instead of binary negative, use a continuous label based on distance. Very close items get weak negative signal; distant items get strong signal.
LABEL NOISE AMPLIFICATION
Hard negative mining amplifies label noise. If 5% of negatives are mislabeled, random sampling sees 5% noise. But mining specifically selects the hardest examples—which are disproportionately mislabeled (they are hard because they are actually positives). Noise rate in mined set can be 20-30%.