Fraud Detection & Anomaly DetectionHandling Imbalanced Data (SMOTE, Class Weighting, Focal Loss)Medium⏱️ ~2 min

Class Weighting and Focal Loss: Reweighting the Loss Function

Class weighting and focal loss fix imbalance by changing how much each training example contributes to the loss function, rather than altering the data distribution. Class weighting multiplies the loss for each class by a weight, typically inverse to class frequency. For fraud at 0.2% prevalence, you might set the weight for positive examples to 500 and negatives to 1, so the 2,000 fraud cases contribute as much total loss as the 998,000 legitimate transactions. The model now sees equal pressure to fit both classes, correcting the gradient flood from easy negatives. Focal loss goes further by distinguishing between easy and hard examples within each class. It uses a modulating factor that shrinks loss as the model becomes confident on an example. The formula includes a tunable gamma parameter (typically 1 to 3): when gamma equals 2 and the model predicts 0.9 probability for a true positive, the loss is scaled down by a factor of 0.01, nearly ignoring that easy positive. Hard examples where the model is uncertain keep full loss. This focuses gradient updates on the examples the model struggles with, whether positive or negative. Focal loss also supports an alpha class weight for additional asymmetry. Class weighting is simple, fast, and preserves all training data without synthetic generation or memory overhead. It is the safe default at scale and works across tabular, text, and image domains. Training time stays linear with data size. The limitation is that it treats all negatives equally: a trivial legitimate transaction gets the same weight as a sophisticated edge case that resembles fraud. Focal loss solves this by de-emphasizing easy negatives, which is critical when positives are below 0.5%. Ad platforms with 0.5% Click Through Rate (CTR) report that focal loss or hard negative mining is necessary to achieve useful recall at high precision. The main trade-off with focal loss is calibration. By compressing loss on confident predictions, focal loss often produces poorly calibrated probability estimates. Scores may rank correctly but do not reflect true likelihoods. If your business logic needs accurate probabilities to allocate human review capacity or set pricing, you must apply post-hoc calibration on a holdout set with the natural base rate. For systems where you just need to rank and threshold (like choosing top 1% riskiest transactions), calibration matters less. Teams at Stripe and PayPal prioritize class weighting for interpretable, calibrated probabilities, and reserve focal loss for extreme imbalance or dense detection tasks like image-based content moderation.
💡 Key Takeaways
Class weighting sets per-class loss multipliers, typically inverse to frequency: 0.2% fraud gets weight 500, legitimate gets weight 1 to balance gradient contributions
Focal loss uses a modulating factor with gamma parameter (1 to 3) that down-weights easy examples: confident prediction at 0.9 probability gets loss scaled by 0.01 when gamma equals 2
Class weighting is simpler and preserves calibration, making it the default choice for tabular and text models at scale with no memory or training time overhead
Focal loss is valuable for extreme imbalance below 0.5% and when easy negatives flood training, common in ad Click Through Rate prediction and dense image detection
Focal loss often produces poorly calibrated probabilities due to loss compression on confident examples, requiring post-hoc calibration on holdout sets for accurate risk estimates
Production systems combine both: class weights match business costs and priors, focal loss targets hard examples when base rate is below 0.1%
📌 Examples
Stripe fraud detection: class weighting with inverse frequency and cost-sensitive thresholds for calibrated probabilities and interpretable decisions
Ad platform CTR prediction at 0.5%: focal loss with gamma equals 2 to fight easy negative flood and achieve recall at high precision serving thresholds
Meta content moderation with hate speech below 0.1%: focal loss for training on billions of posts, followed by calibration stage before routing to human review queues
← Back to Handling Imbalanced Data (SMOTE, Class Weighting, Focal Loss) Overview