Learn→Fraud Detection & Anomaly Detection→Supervised Anomaly Detection (Imbalanced Classification)→4 of 6
Fraud Detection & Anomaly Detection • Supervised Anomaly Detection (Imbalanced Classification)Hard⏱️ ~3 min
Label Delay and Feedback Loops: The Hidden Challenges of Fraud Detection
Label delay is the defining challenge of production fraud detection systems. Chargebacks arrive 30 to 90 days after the transaction, sometimes over 120 days for international cards. During this delay, you are deploying models trained on incomplete data and making decisions that determine which future labels you will see. This creates feedback loops where model actions bias the training data.
The feedback loop works like this: if you auto block transactions with scores above 0.15, you never learn the true outcome of those transactions. The model drifts toward approving borderline fraud because only approved transactions yield ground truth labels. Over time, precision at the block threshold appears artificially high because you are only measuring the few cases that slip through. PayPal observed this when a model showed 95% precision on auto blocks in monitoring, but a manual review of a sample found true precision was closer to 70%. The missing 25 percentage points were blocked fraud that never generated chargebacks.
Mitigation requires deliberate exploration and proxy labels. Exploration means approving a small random sample (0.1% to 1%) of transactions across all score ranges, including high risk. Stripe runs exploration on 0.5% of traffic with strict loss caps per merchant. This yields unbiased labels but accumulates real fraud losses. Proxy labels provide faster feedback: network risk codes from Visa and Mastercard arrive within 24 hours, merchant disputes surface within 3 to 7 days, and account closures indicate historical fraud. These proxies are noisy but break the feedback loop. Models train on a blend: 80% weight on confirmed chargebacks plus 20% weight on proxies, then calibrate on chargeback only validation sets.
Time based validation is critical. You must split on event time, not randomly. Train on January through March, validate on April, test on May. Compute features using only data available at decision time. A common error is using 30 day aggregates that include the future: for a transaction on March 15th, the 30 day window must end on March 15th, not March 31st. Leakage from temporal misalignment can inflate PR AUC by 2x to 10x. Resampling across time also causes leakage: if you undersample normals, group by user or card to avoid putting the same entity in both train and test with different labels.
💡 Key Takeaways
•Chargebacks arrive 30 to 90 days after transaction, making short term model evaluation impossible without proxy labels
•Feedback loops occur when auto block decisions prevent label collection: blocked fraud never generates chargebacks, biasing training data
•Exploration traffic (0.1% to 1%) approves random samples across all risk scores to collect unbiased labels, with strict loss caps per merchant
•Proxy labels provide fast feedback: network risk codes in 24 hours, merchant disputes in 3 to 7 days, but are noisier than true chargebacks
•Temporal leakage from random splits or future looking features inflates validation PR AUC by 2x to 10x, use strict time based splits
•Group by entity when resampling to prevent same user or card appearing in both train and test with different labels
📌 Examples
PayPal feedback loop: model showed 95% precision on auto blocks in monitoring but manual sample review found 70% true precision, 25 percentage points from missing blocked fraud labels
Stripe exploration: 0.5% of transactions approved randomly across all scores with $10K monthly loss cap per merchant, yields unbiased labels for calibration
Uber safety model: combines 80% weight on confirmed safety incidents (14 day delay) with 20% weight on driver reports (same day) and cancelation patterns (24 hours)