Online Serving Architecture: Dynamic Batching and Caching

Online Serving Architecture
Serving a trained model at scale requires careful system design. The model itself is only part of the solution. Infrastructure for batching, caching, and load balancing determines whether you achieve target latency and throughput.
Dynamic Batching
GPUs excel at parallel computation. Processing one image takes 5ms; processing 32 images takes 8ms. Without batching, you waste 90%+ of GPU capacity on memory transfers and kernel launches.
How it works: Requests queue until either a batch fills (e.g., 32 images) or a timeout expires (e.g., 10ms). Larger batches increase throughput but add latency. A 50ms batching window means the fastest possible response is 50ms.
Adaptive batching: During traffic spikes, batches fill quickly and latency stays low. During low traffic, timeout triggers before batches fill, preventing requests from waiting forever.
Caching Strategies
Result caching: Store (image_hash, class_prediction) pairs. When the same image reappears, return the cached result without inference. Hit rates of 20-40% are typical for user upload scenarios.
Embedding caching: Store intermediate embeddings from the model backbone. For similar images, retrieve nearby embeddings and compare distances. Useful for near-duplicate detection.
Load Balancing
Model servers sit behind a load balancer that distributes requests. Round-robin works for homogeneous servers. For mixed GPU types, use weighted distribution based on server throughput capacity. Health checks remove failing servers from rotation within seconds.

💡 Key Takeaways

✓Dynamic batching improves GPU utilization from under 10% to 80%+ by processing multiple images together

✓Batch size vs timeout trade-off: larger batches increase throughput but add minimum latency equal to timeout

✓Result caching achieves 20-40% hit rates for user uploads; embedding caching enables near-duplicate detection

✓Weighted load balancing accounts for heterogeneous GPU server capacities

📌 Interview Tips

1Interview Tip: Explain the batching latency trade-off with numbers - 50ms timeout means 50ms minimum response time regardless of GPU speed

2Interview Tip: Mention cache invalidation as a hidden complexity - model updates require cache flushes or versioned keys

← Back to Image Classification at Scale Overview