Fraud Detection & Anomaly Detection • Graph-based Fraud Detection (GNNs)Hard⏱️ ~3 min
Production Serving Architecture: Latency and Scale Trade-offs
A realistic online fraud detection flow for card present or card not present payments has a 50 to 100 ms decision budget at the 95th percentile while handling 5,000 to 50,000 events per second per region during peak shopping periods. The graph service holds hundreds of millions of nodes and billions of edges, for example 200 million entities and 5 to 20 billion edges with a rolling 30 to 90 day window. The serving stack must return decisions with low tail latency and high availability, requiring careful architecture choices.
The end to end flow breaks latency across stages. First, ingest processes transactions and events arriving on a streaming bus. A feature pipeline computes rolling counters like spend in last 1, 24, and 168 hours, device changes per week, and neighbor degree percentiles. Entity resolution normalizes device and identity signals to limit graph spam from bots creating fake nodes. This happens asynchronously before scoring.
Second, graph access fetches a 2 hop neighborhood around the payer, payee, card, and device within the last 7 to 30 days. To cap latency, the system samples up to N neighbors per type per hop, for example 20 cards per device and 10 devices per card. Tight bounds prevent neighborhood explosion where a popular merchant with millions of connections would blow up query time. With in memory adjacency lists and cache locality, this fetch runs in 2 to 5 ms p50 and 8 to 15 ms p95. Caching hot entities like active devices and merchants in RAM reduces tail latency.
Third, scoring applies the fraud model. Most systems precompute node embeddings offline every few hours using a full GNN on a graph snapshot, storing 128 to 256 dimensional vectors in a feature store or in memory cache. At inference time, a small online GNN layer or multilayer perceptron (MLP) operates on precomputed embeddings plus temporal edge features for freshness, targeting 5 to 15 ms p95 on CPUs with vectorization. When budgets are tighter, a cached embedding lookup plus a lightweight classifier runs in 2 to 5 ms. A rules engine executes explainable checks like velocity limits and known bad lists in parallel, adding 1 to 5 ms.
Fourth, decision and action fuse GNN score, tabular model score, and rules into a final risk score. Thresholds drive outcomes: allow, step up authentication with two factor verification, or route to manual review. The decision is logged, counters are updated, and the graph is appended with the new transaction edge for future queries.
This hybrid offline plus online design balances accuracy and latency. Precomputing embeddings captures expensive multi hop message passing offline, while online layers add temporal context from the last few minutes. Ant Group and Alibaba report billion scale graphs for risk control using neighborhood sampling for real time inference within tens of milliseconds. The trade off is embedding staleness. Refreshing embeddings every 1 to 6 hours reduces compute cost and smooths predictions, but risks missing fast evolving fraud rings. Systems handle this with online temporal features that capture velocity spikes even when embeddings are hours old.
Another critical trade off is neighborhood depth versus latency variance. Deeper GNNs with 3 or 4 hops and larger neighborhoods (50 plus neighbors per hop) capture long range collusion better but increase tail latency and introduce high variance from sampling randomness. Most production stacks limit to 2 hops with strict per type sampling caps (10 to 25 neighbors per hop), then rely on attention mechanisms and time decay rather than very deep layers to capture context.
💡 Key Takeaways
•Hybrid design precomputes expensive node embeddings offline every 1 to 6 hours using full GNN on graph snapshot, then applies fast online layer with temporal features to meet 50 to 100 ms p95 latency
•Graph fetch targets 2 to 5 ms p50 and 8 to 15 ms p95 by sampling strict neighbor caps per type (20 cards per device, 10 devices per card) and caching hot entities in memory
•Scoring splits into precomputed embeddings lookup (2 to 5 ms), online MLP or small GNN layer (5 to 15 ms p95 on CPU), and parallel rules engine (1 to 5 ms)
•Embedding refresh cadence trades freshness versus cost and stability: 1 to 6 hour refresh reduces compute and smooths predictions but risks missing fast evolving rings, mitigated by online temporal velocity features
•Limiting to 2 hops with 10 to 25 neighbors per hop prevents neighborhood explosion and latency variance, relying on attention and time decay rather than very deep layers for context
📌 Examples
Ant Group billion scale graph: Samples neighborhoods in real time using in memory adjacency and achieves tens of milliseconds inference latency for risk control decisions.
Embedding staleness scenario: Offline embeddings computed at 3am capture graph state from that snapshot. At 8am during shopping peak, a new fraud ring emerges. Online velocity features (3 transactions in 5 minutes from new device) flag the activity despite stale embeddings.
Latency breakdown for 100ms p95 budget: Feature pipeline async (0ms), graph fetch 15ms, embedding lookup 3ms, online GNN layer 12ms, rules 4ms, decision fusion 2ms, logging and counters 4ms, total 40ms with 60ms headroom.