Loss Balancing and Gradient Interference
The Loss Balancing Problem
Each task has its own loss function. Classification uses cross-entropy. Regression uses mean squared error. Detection uses a combination of localization and classification losses. These losses have different scales and gradients.
The problem: If detection loss is 100x larger than classification loss, the model optimizes almost entirely for detection. Classification performance suffers because its gradients get overwhelmed.
Manual Loss Weighting
The simplest approach: multiply each loss by a weight. Total loss = w1 × loss1 + w2 × loss2 + w3 × loss3. Tune weights manually until all tasks perform acceptably.
Practical approach: Start with weights that normalize loss magnitudes. If one loss averages 10 and another averages 0.1, use weights of 0.01 and 1.0 respectively. Then adjust based on validation performance.
Gradient Interference
Even with balanced losses, task gradients can conflict. Task A wants to increase a weight; Task B wants to decrease it. The net gradient is small, but both tasks suffer. This is called destructive interference.
Detection: Monitor individual task losses during training. If one task improves while another degrades, gradient interference is likely occurring in shared layers.
Mitigation: Gradient surgery techniques modify conflicting gradients before applying them. Project each task gradient to remove components that conflict with other tasks. This preserves beneficial updates while eliminating destructive ones.
Dynamic Loss Weighting
Instead of fixed weights, adjust weights during training based on task difficulty or progress. Tasks that are learning slowly get higher weights; tasks that have converged get lower weights. This keeps all tasks improving throughout training.