Fraud Detection & Anomaly DetectionFeature Engineering (Temporal Patterns, Aggregations, Velocity)Medium⏱️ ~2 min

Trade-offs: Window Size, Exactness, and Feature Breadth

Designing temporal features requires navigating fundamental trade-offs between responsiveness and stability, cost and accuracy, and model complexity and latency. Window size creates the core trade-off. Short windows like 1 minute respond quickly to changes but amplify noise. A single retry burst or legitimate high activity period can trigger false alarms. Long windows like 7 days smooth noise and provide stable baselines but take days to adapt to real behavior shifts. A compromised account might transact fraudulently for hours before a 7 day average moves enough to signal anomaly. The standard solution is multi window features: include 1 minute, 5 minute, 1 hour, 24 hour, and 7 day counts, and let the model learn optimal weights. Gradient boosted trees excel at this, automatically selecting the most informative window per split. The cost is increased feature volume and memory. Five windows per feature per entity scales state linearly. At 10 million active cards and 20 features, this is 1 billion numeric values in memory, requiring sharding and careful eviction. Exactness versus approximation affects high cardinality aggregates like distinct counts. Exact distinct device IDs per merchant in 24 hours requires storing every device ID seen, which for large merchants means megabytes per key. Probabilistic sketches like HyperLogLog reduce this to 1 to 2 kilobytes with 2% error. Use approximations for network level signals where small errors are tolerable and entity counts are high. Keep exact counts for critical entity features when budget allows and cardinality is bounded, for example distinct cards per IP where typical legitimate users have 1 to 3 cards. Online computation versus precomputation trades latency variance for freshness. Computing all aggregates online increases tail latency when complex features like percentiles or cross entity joins are needed. Precomputing long windows in batch reduces online pressure but introduces staleness. A 24 hour window updated hourly is up to 1 hour stale. Choose hybrid: compute windows under 1 hour online with stream processors, and refresh windows longer than 1 hour from batch. This balances freshness for fast signals with cost efficiency for slow signals. Feature breadth versus inference latency is critical for real time systems. Each additional feature increases fetch time and model computation. Fraud systems targeting 50 millisecond total latency allocate 10 milliseconds for feature fetch, 30 milliseconds for inference, and 10 milliseconds for overhead. With 3 millisecond p50 cache reads, this allows fetching about 30 features with some headroom for tail latency. Ads systems with 100 millisecond budgets tolerate 200 to 500 features. To stay within budget, precompute cross entity features offline and cache them, rather than computing on the fly. For example, percentile rank of transaction amount within merchant distribution is computed daily and looked up at inference time.
💡 Key Takeaways
Short windows like 1 minute respond quickly but amplify noise from retries; long windows like 7 days are stable but take days to detect real shifts
Multi window strategy uses 1 minute, 5 minute, 1 hour, 24 hour, and 7 day counts together, letting gradient boosted trees learn optimal weights per feature
Exact distinct counts require storing all IDs (megabytes per merchant); HyperLogLog sketches reduce to 1 to 2 KB with 2% error for network level aggregates
Hybrid online and batch: compute windows under 1 hour in stream processors for freshness, refresh longer windows from batch hourly to save cost
Inference latency budget of 50ms allocates 10ms for feature fetch, 30ms for model, 10ms overhead; with 3ms p50 cache reads, fetch up to 30 features
Precompute expensive cross entity features like percentile rank of transaction amount within merchant distribution daily, then cache for sub millisecond lookup
📌 Examples
Stripe fraud model: 1 minute window catches velocity attacks (20 attempts in 60 seconds), 7 day window establishes baseline (5 per day average), model weights 1 minute 3x higher for high risk merchants
PayPal distinct device count: large merchant has 50K devices in 24 hours; exact storage is 50K × 16 bytes = 800KB per merchant, HyperLogLog is 1.5KB with 2% error, chosen for network features
Uber real time pricing: compute 5 minute demand per geohash online for immediate surge response, update 8 week seasonal profiles batch daily for baseline, merge at inference
Amazon fraud detection: 50ms latency budget with 30 features at 3ms p50 read = 10ms feature fetch p50, 50ms p99; adding 20 more features would push p99 to 80ms and violate SLA
← Back to Feature Engineering (Temporal Patterns, Aggregations, Velocity) Overview
Trade-offs: Window Size, Exactness, and Feature Breadth | Feature Engineering (Temporal Patterns, Aggregations, Velocity) - System Overflow