Fraud Detection & Anomaly Detection • Real-time Scoring (Low-latency Inference)Hard⏱️ ~3 min
Online Feature Store Architecture for Sub 10ms Reads
The feature store is often the bottleneck in real time scoring. Your model needs dozens to hundreds of features, and fetching them dominates end to end latency if not designed carefully. An effective online feature store uses a layered caching strategy with in process memory, regional caches, and a backing database, each layer trading off latency against freshness and cost.
The fastest layer is an in process cache inside the scoring service itself, typically implemented with a least recently used eviction policy and holding the hottest features in the heap. Reads complete in microseconds to 1 millisecond since there is no serialization or network hop. This works well for features that rarely change, like user account creation date or merchant category, and for read heavy workloads where a small set of features gets accessed repeatedly. The downside is memory overhead and cache invalidation complexity. If feature values update, you need a mechanism to invalidate or refresh entries across all service instances.
The second layer is a regional cache, often Redis or Memcached, sitting close to the scoring service within the same availability zone. This provides 1 to 3 millisecond reads for 95 to 99 percent of requests. Features are keyed by entity ID and versioned to avoid training serving skew. When the scoring service needs a feature, it batches keys and issues a single multi get request to minimize round trips. Cache misses fall through to the online store, which might be DynamoDB, Cassandra, or a custom key value store optimized for low latency reads, responding in 5 to 10 milliseconds.
Feature freshness becomes a critical tradeoff. Precomputing all features and writing them to the store gives low read latency but increases storage and write costs, especially for high cardinality features like user session aggregates over sliding windows. An alternative is to compute lightweight features on the read path, for example counting recent events from a small in memory buffer. This reduces storage but costs CPU time and adds milliseconds to the scoring path. Hybrid strategies precompute expensive aggregations offline and compute simple transformations like ratios or differences on read.
Streaming features add another dimension. For real time fraud detection, you might need the count of transactions in the last 5 minutes. This requires a streaming pipeline that materializes incremental aggregates with minute level freshness. The pipeline consumes events from Kafka, updates rolling windows in memory using something like Flink or a custom service, and writes updated features to the online store every few seconds. Freshness Service Level Agreements (SLAs) are often under 1 minute, meaning a feature value is at most 60 seconds stale. This level of freshness improves model accuracy significantly, for example increasing fraud recall by 10 to 15 percent compared to hourly batch updates, but requires maintaining a complex streaming infrastructure.
Schema validation and versioning prevent silent training serving skew. Features are stored with a schema version, and the scoring service validates that it is reading the same feature definition used during training. Mismatches, like a feature that was log transformed in training but served raw, cause model accuracy to collapse. Monitoring feature distributions in production and comparing them to training distributions catches drift early.
💡 Key Takeaways
•Three layer caching with in process at microseconds to 1ms, regional Redis at 1 to 3ms with 95 to 99 percent hit rate, and online store at 5 to 10ms for misses
•Precomputed features reduce read latency but increase storage and write costs, while on read computation saves storage but adds milliseconds and CPU overhead to scoring
•Streaming feature pipelines materialize incremental aggregates with under 1 minute freshness SLA, improving fraud recall by 10 to 15 percent compared to hourly batch updates
•Schema versioning and validation prevent training serving skew where log transformed features in training get served raw, causing model accuracy to collapse
•Batch multi get requests to regional cache minimize round trips, fetching dozens of features in a single 2ms call instead of sequential 2ms per feature reads
•Hybrid strategies precompute expensive window aggregates offline and compute simple transformations like ratios or differences on the read path within the latency budget
📌 Examples
Stripe online feature store uses DynamoDB with DAX (DynamoDB Accelerator) in memory cache, achieving p99 reads under 5ms for user lifetime features and sub 1 minute freshness for transaction counts
Uber computes driver location and recent trip features in a Flink streaming job that writes to a regional Redis cluster every 10 seconds, supporting tens of thousands of dispatch decisions per second
Amazon recommendation service batches 50 to 100 feature keys in a single Redis multi get call, reducing network overhead and completing feature fetch in 3ms p99 within a single availability zone