Fraud Detection & Anomaly DetectionFeature Engineering (Temporal Patterns, Aggregations, Velocity)Hard⏱️ ~3 min

Failure Modes: Label Leakage, Skew, and Adversarial Evasion

Temporal feature systems fail in subtle ways that inflate offline metrics but degrade production accuracy. The three most critical failure modes are label leakage, training serving skew, and adversarial evasion. Label leakage occurs when feature computation uses data from the future relative to prediction time. The classic mistake: computing a 24 hour transaction count that includes events after the label timestamp during training. Offline accuracy jumps to 95%, but production drops to 78% because future data is unavailable at inference. Prevent this with strict point in time joins. For each training example at timestamp T, query the feature store for values as of T minus 1 second. Validate with negative tests: compute features at T plus 1 hour and confirm that results differ, proving the join respects time boundaries. Uber discovered label leakage when ETA model trained on batch features showed 12% mean absolute error offline but 18% in production; the batch job used arrival time instead of request time, leaking future traffic information. Training serving skew happens when online and offline feature logic diverges. Causes include different time zone handling, rounding, window boundary alignment, or deduplication logic. If offline computes day of week in UTC but online uses local time, and the model relies on weekend patterns, accuracy drops 5 to 10% after deployment. If offline uses tumbling 1 hour windows aligned to the hour but online uses sliding 60 minute windows, counts differ by up to 100% at window boundaries. Prevent skew with shared feature definitions in code, not documentation. Use the same libraries and parameters in both paths. Deploy shadow mode: run online features alongside production, log both, and compare distributions daily. Stripe caught a skew where offline deduplicated by transaction ID but online deduplicated by idempotency key, causing online counts to be 8% lower and degrading precision by 3 points. Adversarial evasion targets velocity and aggregation features. Attackers spread attempts across many devices or time windows to stay below thresholds. If the rule blocks cards with more than 10 attempts in 5 minutes, the attacker waits 6 minutes between bursts. If the system only tracks per card velocity, the attacker uses 100 stolen cards with 2 attempts each. Defenses include multi entity features (track per IP, per device, per billing address), multiple window lengths (1 minute, 5 minutes, 1 hour, 24 hours), and acceleration (rising rate is suspicious even if absolute count is moderate). Use model outputs instead of hard thresholds to reduce predictability. Add cross entity ratios: if a device fingerprint is associated with 50 distinct cards in 1 hour, block regardless of per card velocity. PayPal detects evasion by tracking the graph connectivity of cards, IPs, and emails; a tightly connected cluster with high velocity indicates a coordinated fraud ring even if individual node metrics look normal. Cold start and regime shifts are secondary failures. New entities have no history, making velocity and lag features missing. Backfill with cohort priors: for a new merchant, use the median transaction rate for its category and region. Decay toward entity specific values as data accumulates. Holidays, pandemics, and product launches shift seasonality. Long window aggregates anchor the model to outdated baselines. Add change point detectors and reduce the weight of old windows during detected shifts. Amazon adjusts seasonal features during Prime Day by shortening lookback windows from 7 days to 1 day to capture the surge, then reverts after the event.
💡 Key Takeaways
Label leakage uses future data in training features: 24 hour count including events after label timestamp inflates offline accuracy but fails in production with 95% to 78% drop
Training serving skew from divergent logic: offline day of week in UTC versus online in local time causes 5 to 10% accuracy drop when model relies on weekend patterns
Shadow mode catches skew: run online features alongside production, log both, compare distributions daily; Stripe found 8% count difference from deduplication logic mismatch
Adversarial evasion spreads attempts across entities or time: attacker uses 100 cards with 2 attempts each to evade per card velocity threshold of 10 in 5 minutes
Multi entity defense: track per card, per IP, per device, per billing address velocity; if device has 50 distinct cards in 1 hour, block regardless of per card count
Cold start backfills with cohort priors: new merchant uses median transaction rate for its category and region, decays toward entity specific as data accumulates over 20 transactions or 1 day
📌 Examples
Uber ETA label leakage: batch features used arrival time instead of request time, leaking future traffic data; offline MAE was 12% but production MAE was 18%, a 50% increase
Stripe deduplication skew: offline used transaction ID, online used idempotency key; online counts were 8% lower, causing precision to drop 3 points from 92% to 89% after deployment
PayPal fraud ring evasion: attacker uses 200 cards from 50 IPs, each card attempts 3 transactions in 10 minutes; per card velocity is low but graph clustering reveals coordinated attack
Amazon Prime Day regime shift: model trained on 7 day lookback fails during surge; switching to 1 day lookback for event duration captures 3x demand spike and improves forecast accuracy by 25%
← Back to Feature Engineering (Temporal Patterns, Aggregations, Velocity) Overview
Failure Modes: Label Leakage, Skew, and Adversarial Evasion | Feature Engineering (Temporal Patterns, Aggregations, Velocity) - System Overflow