Fraud Detection & Anomaly DetectionReal-time Scoring (Low-latency Inference)Medium⏱️ ~3 min

Accuracy vs Latency Trade-offs: Model Cascades and Dynamic Batching

One of the hardest decisions in real time scoring is balancing model accuracy against inference latency and cost. Larger models with more parameters, deeper trees, or longer transformer sequences improve metrics like Area Under Curve (AUC) or Normalized Discounted Cumulative Gain (NDCG), but they increase p99 latency and burn more CPU or GPU cycles. A transformer based ranker might achieve 0.85 precision at k equals 10 compared to 0.82 for a simpler model, but inference takes 50ms instead of 10ms. Is that 3 percent accuracy gain worth 5 times the latency? Many companies solve this with model cascades. A tiny gating model runs first and handles 80 to 95 percent of requests. This model is optimized for speed, often a small gradient boosted tree or logistic regression running in 2 to 5 milliseconds. It classifies requests as clearly safe, clearly risky, or ambiguous. The safe and risky cases get immediate decisions. Only ambiguous cases, the 5 to 20 percent in the middle, go to a heavier second stage model that might be a deep neural network taking 20 to 50 milliseconds. This keeps overall p99 latency low while preserving accuracy where it matters. Batching is another critical tradeoff. Processing requests in batches improves throughput and GPU efficiency dramatically. A transformer on GPU might handle 10 requests per second individually but 5000 requests per second with batch size 32. The catch is queueing delay. If you wait to fill a batch, early requests sit idle. Dynamic batching solves this by enforcing a maximum wait time, typically 1 to 5 milliseconds, and flushing whenever the batch is full or the timer expires, whichever comes first. For very tight latency budgets under 20 milliseconds, you might skip batching entirely and run batch size 1 on CPU, accepting lower throughput to eliminate queueing variance. You also choose between CPU and accelerators. CPUs offer predictable latency and lower cost for small models like gradient boosted trees and compact multi layer perceptrons. GPUs excel at transformers and convolutional networks but need batching to amortize data transfer and kernel launch overhead. Under low traffic, GPUs have poor utilization, raising per request cost. Some teams dynamically switch between CPU and GPU serving profiles based on traffic level, or they route simple requests to CPU and complex ones to GPU.
💡 Key Takeaways
Model cascades use a fast gating model in 2 to 5ms to handle 80 to 95 percent of traffic, routing only ambiguous cases to a heavier model taking 20 to 50ms
Dynamic batching with 1 to 5ms max wait improves GPU throughput from 10 to 5000 requests per second but adds queueing delay that can break tight latency budgets
Batch size 1 on CPU eliminates queueing variance for sub 20ms targets, trading throughput for predictable latency with gradient boosted trees or small neural nets
CPUs give predictable latency and lower cost per request for small models, while GPUs need batching to amortize overhead and can have poor utilization under low traffic
A 3 percent accuracy improvement from 0.82 to 0.85 precision at k equals 10 might cost 5 times higher latency, requiring careful cost benefit analysis per use case
Teams sometimes route by request complexity or switch hardware profiles dynamically based on traffic, running simple requests on CPU and complex ones on GPU
📌 Examples
Amazon uses two stage ranking: a lightweight candidate generator filters thousands of items in under 10ms, then a heavier ranker scores the top 100 in 20 to 40ms
Uber dispatch employs a cheap filtering model to remove obviously bad driver matches in under 10ms, then refines the remaining candidates with a precise model in the remaining latency budget
A fraud detection system might run a logistic regression on all transactions in 3ms, escalating only the middle 10 percent risk band to a deep neural network that takes 30ms
← Back to Real-time Scoring (Low-latency Inference) Overview