Learn→Fraud Detection & Anomaly Detection→Handling Imbalanced Data (SMOTE, Class Weighting, Focal Loss)→6 of 6
Fraud Detection & Anomaly Detection • Handling Imbalanced Data (SMOTE, Class Weighting, Focal Loss)Hard⏱️ ~3 min
End-to-End Production Architecture for Imbalanced Data Systems
Production systems for imbalanced data are not just training algorithms; they are end-to-end pipelines spanning data curation, training, evaluation, calibration, deployment, and operations. Consider a card network fraud detector handling 15 million transactions daily with 0.2% fraud. Online inference runs at 5,000 to 20,000 requests per second with p99 latency under 10 milliseconds. Labeling arrives with 7 to 30 day delay due to chargeback windows. The architecture starts with an events store that maintains features, labels, and timestamps with time travel capability for reproducibility.
The training pipeline includes a resampling service that creates training shards with controlled class ratios. Teams often downsample negatives to 1 to 20 for training efficiency while recording the true prior of 0.2% and the training prior of 5%. The trainer applies class weighting to reweight loss back to the true prior, ensuring the model learns the correct decision boundary despite seeing artificially balanced data. Validation and test sets are kept at the natural 0.2% prevalence to compute Precision Recall Area Under the Curve (PR AUC), cost curves, and policy thresholds that reflect real deployment conditions. Training on GPU clusters might take 4 to 8 hours for a model on 100 million transactions with 200 features.
After training, a calibration stage runs if focal loss was used or if probability estimates are critical for downstream logic. Calibration uses isotonic regression or Platt scaling on a held out set with natural prevalence. The calibrated model goes to a model registry with versioning, metadata, and A/B testing support. Deployment uses a multi-stage serving architecture. Stage one is a lightweight model with 50 to 100 features, optimized for sub 5 millisecond p99 latency. It uses class weighting and a threshold tuned for high recall at moderate precision, catching 95% of fraud while flagging 1% to 5% of all transactions. These flagged transactions route to stage two, a richer model with 500 features and possibly ensemble methods, running under 50 to 100 milliseconds p99. Stage two applies stricter thresholds and may escalate the top 0.1% to human review queues with finite capacity, such as 30,000 items per day.
Operational monitoring is continuous. The system tracks base rate estimates from delayed labels, computing rolling 7 day and 30 day fraud rates. It monitors score distribution drift using Kolmogorov-Smirnov tests or Population Stability Index (PSI) on key features like transaction amount, merchant category, and user velocity. It measures precision, recall, and F1 at deployed thresholds on cohorts, and tracks calibration drift by bucketing predictions and comparing predicted versus observed rates. Alerts fire if base rate shifts by more than a factor of 2, if precision drops below 20% for more than 60 minutes, or if queue backlog exceeds capacity. These signals trigger retraining, threshold adjustments, or incident response.
Feedback loops close the system. User interactions, chargebacks, and manual review outcomes flow back into the events store as labels. A nightly batch job updates training datasets, retrains models, evaluates them on the latest validation window, and publishes new models if they beat existing production models on PR AUC and cost metrics. A/B testing frameworks route a small percentage of traffic to candidate models to measure online metrics like false decline rate and fraud catch rate before full rollout. This continuous learning loop adapts to evolving fraud tactics, seasonal traffic patterns, and merchant mix changes.
At scale, teams at Stripe, PayPal, and Amazon run dozens of models segmented by geography, merchant vertical, and transaction type. Each segment has its own prevalence and cost structure, requiring tailored thresholds and possibly distinct training recipes. Feature stores provide online and offline feature access with consistent computation logic to prevent training serving skew. The entire system is orchestrated with workflow engines, monitored with dashboards showing real-time metrics, and governed by strict SLAs on latency, availability, and accuracy.
💡 Key Takeaways
•Production architecture spans events store with time travel, resampling service for controlled class ratios, training with class weighting, and validation on natural prevalence
•Multi-stage serving uses stage one under 5ms p99 for high recall (95%) flagging 1% to 5% of traffic, stage two under 50ms p99 for strict thresholds, human review at 30,000 items per day capacity
•Training pipeline downsamples negatives to 1 to 20 ratio, applies class weighting to reflect true 0.2% prior, trains 4 to 8 hours on 100 million transactions with 200 features on GPU clusters
•Calibration stage uses isotonic regression or Platt scaling on holdout sets with natural prevalence after focal loss training to ensure accurate probability estimates for queue allocation
•Monitoring tracks base rate from delayed labels (7 to 30 days), score distribution drift via Population Stability Index (PSI), precision recall at deployed thresholds, and queue backlog vs capacity
•Feedback loops close the system: user interactions and chargebacks update events store, nightly retraining publishes new models if they beat production on PR AUC and cost metrics via A/B testing
📌 Examples
Stripe fraud detection: events store with 30 day label delay, resampling to 1 to 20, stage one model under 5ms flags 2% of traffic, stage two under 50ms escalates 0.1% to human review
PayPal transaction monitoring: 5,000 to 20,000 QPS, class weighting with inverse frequency, validation on natural 0.2% prevalence, dynamic thresholds adjusted when base rate doubles
Meta content moderation: focal loss training on billions of posts, calibration stage, monitoring queue saturation at 30,000 items per day, alert if hate speech prevalence shifts by factor of 2