Learn→Fraud Detection & Anomaly Detection→Supervised Anomaly Detection (Imbalanced Classification)→2 of 6
Fraud Detection & Anomaly Detection • Supervised Anomaly Detection (Imbalanced Classification)Medium⏱️ ~3 min
Training Strategies for Extreme Class Imbalance: Resampling vs Weighting
Training supervised models on extreme imbalance (0.01% to 1% anomalies) requires special techniques because standard algorithms optimize for the majority class and ignore rare events. The two main approaches are resampling the training data and applying class weights during optimization. Each trades off different aspects of model quality and training efficiency.
Resampling changes the class distribution in training data. Undersampling removes 90% to 99% of normal examples to create a more balanced dataset, for example reducing 1 million normal transactions and 100 fraud cases to 10,000 normal and 100 fraud. This dramatically reduces training time and memory, making it practical to train gradient boosted trees on commodity hardware. The downside is distorted probability calibration: the model learns on a 1% fraud rate but deploys to 0.01% fraud rate. You must recalibrate probabilities using the true prior, and you lose information from discarded majority examples. PayPal uses 10x to 50x undersampling followed by isotonic regression calibration.
Class weighting keeps all data but assigns higher loss to minority errors. Setting fraud weight to 100x and normal weight to 1x makes the optimizer care 100 times more about each fraud case. This preserves the full data distribution and produces better calibrated probabilities. Gradient boosted trees and neural networks support native class weights. The challenge is training stability: extreme weights (over 1000x) can cause overfitting to noise in minority examples. Focal loss is an alternative that automatically emphasizes hard examples without manual weight tuning.
Production systems often combine both. Stripe undersamples normals by 20x to control compute costs, applies 5x class weights to remaining fraud cases for emphasis, and trains on 3 months of data (roughly 50 million transactions with 5,000 to 50,000 fraud labels). Models train overnight on CPU clusters. Hard negative mining adds a third technique: after initial training, replay high scoring false positives back into training with extra weight to teach the model its mistakes.
💡 Key Takeaways
•Undersampling reduces majority class by 10x to 100x to speed training and balance classes, but distorts calibration and discards information
•Class weighting (50x to 200x for fraud) preserves all data and calibration but can cause training instability with extreme ratios
•Focal loss automatically emphasizes hard examples without manual weight tuning, commonly used in neural network fraud models
•Hybrid approach: Stripe undersamples 20x plus 5x class weights on 3 months of data (50M transactions, 5K to 50K fraud labels)
•Hard negative mining replays high confidence false positives back into training with extra weight to fix model mistakes
•SMOTE synthetic oversampling can create unrealistic interpolations in high dimensions and produce overly optimistic validation results
📌 Examples
PayPal fraud model: 50x undersample of normal transactions, isotonic regression recalibration using true 0.1% fraud prior, retrain weekly on 100M transaction sample
Amazon account abuse detection: Focal loss with gamma=2 on neural network, no resampling, trains on full 6 months of signups (500M examples with 50K abuse cases)
Uber safety model: 10x undersample plus 10x class weights, hard negative mining adds 5,000 high score false alarms per training iteration with 3x weight