Computer Vision SystemsMulti-task LearningHard⏱️ ~2 min

Failure Modes: Negative Transfer and Data Drift

Negative transfer is the central risk. When tasks are misaligned or loss balancing fails, shared features become a compromise that hurts all tasks instead of helping. Symptoms include minority tasks with flat or declining validation metrics while the dominant task improves. For example, a safety task with 0.1% positive rate will be overwhelmed by a CTR task with 10% positive rate unless you upsample, reweight, or apply focal loss. The model learns to optimize CTR and ignores safety. In production, this manifests as policy violations or user complaints that metrics did not predict. Label misalignment and feature leakage create silent bugs. Conversion labels arrive hours or days after impression, but if you align training windows incorrectly, the model may use features computed after the conversion decision time. This inflates offline AUC by 5 to 10% but produces useless predictions in serving. Strict feature time travel checks are required. Every feature must have a timestamp, and training enforces that feature timestamp is before label timestamp. Replay windows for delayed labels must be carefully configured. For example, train daily with a 7 day replay so conversion labels from days 1 through 6 update parameters, but features are always from impression time. Calibration drift across heads breaks downstream business logic. A ranker combines predicted CTR and predicted CVR to compute expected value. If multi task training shifts the scale of one head, the combined score distribution changes, and bid strategies or budget pacing break. Monitor per head calibration weekly. Apply isotonic regression or Platt scaling per head on holdout data if calibration degrades. Some teams keep separate calibration layers per head that are retrained more frequently than the main model. Data imbalance and rare event tasks are fragile. A fraud detection head with one positive per 10,000 samples will drown in a sea of click labels. Mitigate with focal loss that down weights easy negatives, upsample minority class by 10 to 100 times, or train that head with a separate sampling schedule. Evaluation windows must be long enough to collect hundreds of positives for rare tasks before making rollout decisions. For example, a 1% experiment slice might need a week to collect enough fraud labels while click labels are sufficient in hours.
💡 Key Takeaways
Negative transfer happens when tasks fight, producing shared features that are suboptimal for all tasks, minority tasks degrade while dominant task improves
Label misalignment causes feature leakage, using post decision features inflates offline AUC by 5 to 10% but fails in production serving
Calibration drift per head breaks downstream logic, predicted probabilities must remain calibrated or combined scores and bidding break
Rare event tasks (fraud 0.01%, safety violations 0.1%) need focal loss, upsampling by 10 to 100 times, or separate sampling schedules
Rollout decisions for rare tasks require long evaluation windows, days to weeks to collect enough positive labels for statistical significance
📌 Examples
Meta safety classifier: Trained with 100x upsampling for policy violations (0.1% base rate), focal loss gamma 2.0, still required 2 week holdout for stable AUC measurement
Uber fraud detection: Feature leakage using driver cancel reason (available after trip) inflated offline AUC from 0.78 to 0.92, caught during serving A/B test with flat precision
Google ads calibration: Weekly isotonic regression per head on 1% holdout traffic, CVR head calibration drifted 5% after training data shift, recalibration restored bid accuracy
← Back to Multi-task Learning Overview
Failure Modes: Negative Transfer and Data Drift | Multi-task Learning - System Overflow