Computer Vision SystemsImage Classification at ScaleHard⏱️ ~3 min

Online Serving Architecture: Dynamic Batching and Caching

Online serving must balance latency, throughput, and cost to meet strict Service Level Objectives (SLOs) for user facing features. The core challenge is that individual inference is inefficient on GPUs, but naive batching destroys latency. Dynamic batching solves this by grouping requests within a maximum wait budget, typically 2 to 8 ms, to trade small latency increases for substantial throughput gains. A stateless inference tier sits behind load balancers with request level batching. When a request arrives, it joins a batch queue. The system waits up to the configured timeout for additional requests to arrive, then dispatches the batch to the GPU. This transforms 20 individual inferences taking 60 ms total (3 ms each serialized) into one batch completing in 10 ms, improving throughput by 6x while adding only 8 ms maximum queueing delay. At batch size 16 to 32, a single A100 GPU achieves 250 queries per second for ResNet50 class models while maintaining p99 latency under 100 ms including network overhead. Caching is critical for cost efficiency. Content hash based caches store predictions keyed by image hash and model version, achieving 80 to 95 percent hit rates for repeated images. Pinterest reports that product catalog images hit cache 92% of the time, reducing inference cost by 12x. Cache misses route to the GPU tier, so cache hit rate directly determines fleet size. At 20,000 queries per second with 90% cache hit rate, only 2,000 QPS need GPU inference. With 250 QPS per GPU at target latency, this requires 8 to 12 GPUs including headroom for variance. Admission control and circuit breakers prevent cascade failures. Systems enforce deadline aware scheduling where requests include a client timeout. If queueing time plus expected batch time exceeds the deadline, the request is rejected immediately rather than consuming resources for a response the client will discard. Circuit breakers trip when error rates exceed thresholds, for example 5% errors in a 10 second window, routing traffic to a fast fallback model or returning coarse labels until the primary recovers.
💡 Key Takeaways
Dynamic batching waits 2 to 8 ms to group requests, improving throughput 6x while adding minimal latency compared to serialized individual inference
A single A100 GPU achieves 250 queries per second for ResNet50 sized models at batch 16 to 32 with p99 latency under 100 ms including network and queueing
Content hash caching achieves 80 to 95 percent hit rates, reducing inference cost by 10 to 12x and directly determining GPU fleet size requirements
At 20,000 QPS with 90% cache hit rate, only 2,000 QPS need GPU inference requiring 8 to 12 GPUs with headroom at 250 QPS per GPU capacity
Deadline aware scheduling rejects requests immediately if queueing plus batch time exceeds client timeout, preventing wasted compute on discarded responses
Circuit breakers trip at error thresholds like 5% errors in 10 seconds, routing traffic to fast fallback models or coarse labels until primary service recovers
📌 Examples
Pinterest product classification: 92% cache hit rate on catalog images reduces GPU fleet from 100 to 8 instances, cutting inference cost from $80K to $7K per month
Dynamic batching example: 20 individual requests at 3 ms each serialized takes 60 ms total, one batch of 20 completes in 10 ms with 8 ms max queue wait, 6x throughput improvement
Meta content moderation: Circuit breaker trips when error rate exceeds 5% in 10 second window, routes to CPU fallback model returning binary safe/unsafe in 50 ms vs full classification in 80 ms
← Back to Image Classification at Scale Overview
Online Serving Architecture: Dynamic Batching and Caching | Image Classification at Scale - System Overflow