Training Strategies for Extreme Class Imbalance: Resampling vs Weighting
The Training Problem
Standard training minimizes overall error. With 1000:1 class ratio, the model sees 1000 normal examples for every fraud example. Gradient updates from the majority class dominate. The model learns to predict "normal" because that minimizes loss on 99.9% of training data.
Two main strategies fix this: change the data distribution (resampling) or change how errors are weighted (class weighting).
Resampling Strategies
Undersampling: Remove majority examples until balanced. 1M normal + 1K fraud becomes 1K + 1K. Fast training but discards 99.9% of normal data, losing patterns.
Oversampling: Duplicate minority examples. 1M + 1K becomes 1M + 1M. Risk: model memorizes duplicates instead of learning patterns.
SMOTE: Creates synthetic minority examples by interpolating between existing ones. For each fraud, find 5 nearest fraud neighbors, create a new example between them. Reduces memorization but can create unrealistic examples.
Class Weighting
Instead of changing data, change the loss function. Multiply minority class loss by the imbalance ratio. If fraud is 0.1% of data, multiply fraud losses by 1000. One misclassified fraud hurts as much as 1000 misclassified normal.
weight = total / (num_classes × class_count). For 1M total with 1K fraud: fraud weight = 1M / (2 × 1K) = 500.When to Use Each
Class weighting: Simpler, preserves all data. Use as default for most problems.
Undersampling: When data is huge and speed matters. Accept information loss.
SMOTE: When minority class is under 100 examples and you need diversity. Validate that synthetic examples are realistic.