Production Serving Architecture: Latency and Scale Trade-offs
The Latency Challenge
Real-time fraud detection requires decisions within 50-100ms. GNN inference must fetch the target node, retrieve its neighborhood (potentially thousands of edges), compute aggregations, and return a score. Each graph traversal adds latency. A 2-hop neighborhood on a dense graph might touch millions of nodes—impossible to compute in real-time without optimization.
Design Trade-off: Larger neighborhoods capture more fraud patterns but increase latency. Production systems typically limit to 1-2 hops with sampled neighbors (10-50 per node) to keep inference under 50ms while retaining most detection power.
Neighborhood Sampling
Rather than fetching all neighbors, sample a fixed number per hop. Uniform sampling selects neighbors randomly. Importance sampling prioritizes suspicious or active neighbors. Stratified sampling ensures representation of different relationship types (device links vs transaction links). The sampling strategy significantly affects which fraud patterns the model catches.
Pre-computed Embeddings
Instead of computing GNN embeddings at inference time, pre-compute node embeddings periodically (hourly or daily) and store them. At inference time, fetch the pre-computed embedding and combine with real-time transaction features. This reduces latency to a simple lookup plus a small neural network forward pass.
Warning: Pre-computed embeddings become stale. A user flagged 1 hour ago still has a clean embedding until the next refresh. Balance freshness (more frequent updates) against computational cost.
Graph Database Selection
The graph store must support fast neighbor lookups. Options: native graph databases (Neo4j, TigerGraph) optimized for traversals, key-value stores (Redis) with adjacency lists, or distributed stores (DynamoDB) for scale. Choose based on query patterns: random access vs batch processing.