Learn→Fraud Detection & Anomaly Detection→Supervised Anomaly Detection (Imbalanced Classification)→5 of 6
Fraud Detection & Anomaly Detection • Supervised Anomaly Detection (Imbalanced Classification)Medium⏱️ ~3 min
Threshold Tuning and Cost Sensitive Decision Making
Supervised fraud models produce calibrated probability scores, but business value comes from converting scores into actions through thresholds. The key insight is that different errors have vastly different costs. Missing a $5,000 fraud case costs $5,000 plus a $25 chargeback fee. Blocking a legitimate $50 transaction costs customer frustration and potential churn. Sending a transaction to human review costs $2 to $5 in analyst time. Optimal thresholds balance these costs to maximize expected profit.
Most production systems use a three action policy instead of binary classification. Auto approve below threshold T_review (typically 0.01 to 0.03), route to human review between T_review and T_block (0.03 to 0.15), and auto block above T_block (above 0.15). The review band is capacity constrained because review teams are finite and expensive. Payment companies target 1% to 2% of transactions in review to keep costs manageable. If fraud rates spike, you must raise T_review to avoid overwhelming the queue, which means approving riskier transactions.
Threshold selection uses cost curves computed on validation data. For each candidate threshold, simulate decisions on historical transactions with known outcomes. Compute expected profit: sum of (transaction value minus fraud loss minus review cost minus false positive friction cost). Stripe runs these simulations weekly on the most recent month of data, accounting for $5 review cost, $25 chargeback fee plus transaction amount for fraud, and estimated $10 to $30 customer lifetime value impact per false decline. Different merchant segments use different thresholds: high value merchants tolerate more review (3% to 5% of transactions) because average transaction values are higher and review cost is proportionally smaller.
Thresholds must adapt to drift. During payment fraud attacks, the fraud rate can spike from 0.1% to 0.5% in hours. A fixed threshold overwhelms review capacity and misses fraud. Adaptive systems adjust T_review based on current queue depth and recompute T_block based on rolling precision estimates from the last 24 hours of proxy labels. Some teams use separate weekend and weekday thresholds because fraud patterns differ. Others apply per country thresholds since fraud rates vary from 0.05% in low risk markets to over 1% in high risk regions.
💡 Key Takeaways
•Three action policy: auto approve (below 0.02), human review (0.02 to 0.15), auto block (above 0.15) based on cost tradeoffs
•Review capacity constrained to 1% to 2% of traffic because each case costs $2 to $5 in analyst time
•Threshold tuning uses cost curves: simulate decisions on validation data accounting for fraud loss ($25 fee plus amount), review cost, and false positive friction
•Per segment thresholds: high value merchants tolerate 3% to 5% review rate, low value merchants stay under 1% due to proportional costs
•Adaptive thresholds adjust for drift: fraud rate spike from 0.1% to 0.5% requires raising review threshold to avoid overwhelming queue
•Separate thresholds by time (weekend vs weekday) and geography (0.05% fraud in low risk vs 1% in high risk countries)
📌 Examples
Stripe cost simulation: $5 review cost, $25 chargeback fee plus transaction amount for fraud, $20 estimated customer lifetime value impact per false decline, run weekly on 1 month validation data
PayPal adaptive thresholds: during attack, fraud rate jumps from 0.1% to 0.4%, system raises review threshold from 0.03 to 0.06 to cap review volume at 2% of traffic
Amazon marketplace: high value electronics category uses 0.01 review threshold (3% traffic) because average order value is $500, apparel uses 0.04 threshold (1% traffic) at $40 average