Fraud Detection & Anomaly DetectionReal-time Scoring (Low-latency Inference)Easy⏱️ ~2 min

What is Real-Time Scoring and Why is Latency Critical?

Real time scoring means running a machine learning model directly on the critical request path where the end user is actively waiting for a response. Unlike batch predictions that run offline on millions of records, real time inference must complete within strict latency budgets measured in milliseconds. The user experience depends on this speed: a payment authorization, a product recommendation, or a fraud check must happen fast enough that the application feels responsive. The controlling metric is p99 latency, not average latency. This measures the 99th percentile, meaning 99 out of 100 requests complete faster than this threshold. Why focus on p99? Because users experience the worst cases, and a single slow request can timeout an entire transaction. If your average is 20ms but your p99 is 200ms, one in every hundred users sees a delay that might fail their payment or cause them to abandon a page. Upstream services typically have their own timeouts in the hundreds of milliseconds, so your scoring service must stay well below that. Typical latency budgets vary by use case. Fraud detection for payments targets 60 to 100ms p99 because authorization flows often take 300 to 2000ms total, dominated by bank issuer time. Product ranking and recommendations aim for 50 to 100ms to keep page render times under 2 seconds. Ride hailing dispatch needs 50 to 150ms to keep matching loops stable when processing tens of thousands of requests per second during peak hours. Large Language Model (LLM) chat has different dynamics: users tolerate seconds of total time but expect 100 to 300ms time to first token so the system feels responsive. The total latency budget must cover everything: network transit between services, authentication checks, fetching features from storage, transforming those features, running the model computation, applying business rules, and logging the result. Each stage consumes part of your budget, so you must decompose the flow carefully and allocate milliseconds to each component.
💡 Key Takeaways
Real time scoring runs on the critical path where users wait, requiring p99 latency in tens to hundreds of milliseconds depending on use case
Fraud detection targets 60 to 100ms p99 for payments, ranking aims for 50 to 100ms, and LLM chat needs 100 to 300ms time to first token
p99 latency matters more than average because users experience worst cases and upstream services timeout on slow requests
Total latency includes network transit, authentication, feature retrieval taking 2 to 10ms, model compute taking 2 to 50ms, and post processing
Throughput varies dramatically: ads ranking handles over 100k requests per second per region while payment fraud scores every authorization
Systems design for 2 to 3 times peak traffic headroom with autoscaling and circuit breakers to prevent cascading failures during spikes
📌 Examples
Stripe and PayPal run fraud scoring inline with payment authorizations, targeting tens of milliseconds at p99 across multiple regions to minimize network hops
Amazon recommendation service has a 50 to 100ms budget within the 2 second page render target, fetching user profiles in 5ms and running two stage ranking in 10 to 40ms
Uber dispatch processes tens of thousands of requests per second during surge, with each decision completing in under 50 to 150ms to keep the matching control loop stable
← Back to Real-time Scoring (Low-latency Inference) Overview
What is Real-Time Scoring and Why is Latency Critical? | Real-time Scoring (Low-latency Inference) - System Overflow