Embeddings & Similarity SearchHard Negative Mining (Triplet Loss, Contrastive Learning)Medium⏱️ ~3 min

Triplet Loss and Contrastive Loss Formulations

Triplet loss is the foundational formulation for hard negative mining. It operates on triplets consisting of an anchor, a positive that should be similar, and a negative that should be dissimilar. The loss enforces that the distance from anchor to positive plus a margin is less than the distance from anchor to negative. Mathematically, the loss is max(0, distance(anchor, positive) minus distance(anchor, negative) plus margin). When this constraint is satisfied, the loss is zero. When violated, the loss pushes the negative away while pulling the positive closer. The margin hyperparameter controls the geometric separation between classes in embedding space. A typical margin is 0.2 to 0.5 for L2 normalized embeddings. Monitoring the fraction of active triplets, those contributing non zero loss, helps tune the margin. A healthy training run maintains 20 to 40% active triplets. Below 10% means negatives are too easy and gradients vanish. Above 60% suggests the margin is too large or mining is too aggressive, risking instability. Contrastive losses generalize triplets to handle many positives and many negatives simultaneously. InfoNCE (Information Noise Contrastive Estimation) treats one positive and N negatives as a classification problem over N plus 1 choices. The loss is negative log of the probability assigned to the positive, computed via softmax over similarities. Temperature scaling controls the sharpness, with lower temperatures (0.05 to 0.1) creating harder distinctions. Supervised contrastive loss extends this to support multiple positives per anchor, averaging over all positives in the denominator. This reduces sensitivity to outlier positives and stabilizes training when positives have high variance. Production systems often prefer contrastive losses because they compute all pairwise similarities once per batch in O(batch size squared) complexity, then apply efficient masking to select valid pairs. Triplet loss requires explicit triplet construction, which is sampling inefficient. With a batch of 1024 and cross device gathering, each anchor sees 1023 in batch negatives for contrastive loss, but triplet loss would need careful sampling to achieve similar coverage.
💡 Key Takeaways
Triplet loss margin of 0.2 to 0.5 controls geometric separation. Monitor active triplet fraction at 20 to 40% to detect if negatives are too easy or too hard
InfoNCE temperature scaling (0.05 to 0.1) sharpens the distribution over negatives. Lower temperatures create harder distinctions but risk numerical instability
Supervised contrastive loss averages over multiple positives per anchor, reducing sensitivity to outlier positives and stabilizing gradients when positive variance is high
Contrastive losses compute O(batch size squared) similarities once per batch, enabling efficient mining. Triplet loss requires explicit triplet sampling, reducing coverage
With cross device gathering and batch size 1024, each anchor sees 1023 in batch negatives for contrastive loss without extra memory or passes
Loss choice depends on task: triplet loss with careful mining for small batches, InfoNCE for large batch systems, supervised contrastive when multiple positives exist
📌 Examples
Triplet loss training with margin 0.3: If anchor to positive distance is 0.15 and anchor to negative distance is 0.4, loss is max(0, 0.15 minus 0.4 plus 0.3) equals 0.05. The triplet is active and contributes gradient.
InfoNCE with temperature 0.07 and 2047 in batch negatives: The softmax denominator sums over 2048 terms. Low temperature amplifies similarity differences, so a negative with cosine similarity 0.9 versus 0.85 creates larger probability gap than with temperature 0.2.
Supervised contrastive loss for product images: Each anchor product has 3 positives (different angles of same item) and 2045 negatives in the batch. Averaging over 3 positives prevents overfitting to one specific camera angle.
← Back to Hard Negative Mining (Triplet Loss, Contrastive Learning) Overview