Computer Vision SystemsReal-time Video ProcessingHard⏱️ ~3 min

GPU Inference Scheduling and Batching Strategies

GPU scheduling balances throughput against latency predictability. Dynamic batching increases GPU utilization from 30% to 85% by grouping multiple frames into a single inference pass, but introduces queuing delay that can violate tight latency Service Level Objectives (SLOs). A ResNet-50 detector processes a single 640p frame in 25 milliseconds on an NVIDIA T4 GPU, but can process a batch of 8 frames in 80 milliseconds, improving throughput from 40 to 100 frames per second (FPS). The cost is that the 8th frame waits for 7 predecessors, adding up to 175 milliseconds of queuing delay on top of the 80 millisecond inference time. For tight SLOs under 200 milliseconds, keep batch sizes small or dedicate one stream per GPU execution context. High priority video streams get dedicated contexts with batch size 1, guaranteeing consistent 25 millisecond inference with no queuing. Lower priority streams share a batching context with maximum batch size 8 and timeout 50 milliseconds, batching up to 8 frames or up to 50 milliseconds, whichever comes first. This hybrid approach serves 10% of high priority traffic with p99 latency under 50 milliseconds while processing remaining 90% at 3x higher throughput. Admission control prevents GPU overload. Define a memory budget per GPU based on model size and maximum batch size. A T4 GPU with 16 gigabytes (GB) memory running a 500 megabyte (MB) model with batch 8 requires roughly 8 GB for activations and intermediate tensors, leaving 8 GB margin. Track allocated streams per GPU and reject new admissions when memory budget is exhausted. Some systems pre-warm models by loading weights and allocating buffers during startup, avoiding 2 to 5 second cold start latency when first inference request arrives. Quantization and pruning reduce per frame inference time at the cost of potential accuracy degradation. INT8 quantization reduces ResNet-50 inference from 25 to 8 milliseconds on T4 with TensorRT, a 3x speedup, while typically degrading mean Average Precision (mAP) by 1 to 3 percentage points. Production systems run fast quantized detectors for real time feedback at 30 FPS, then periodically run full precision verifiers on sampled frames to catch false negatives, balancing latency and accuracy.
💡 Key Takeaways
Dynamic batching improves GPU utilization from 30% to 85% and throughput from 40 to 100 FPS, but adds up to 175ms queuing delay for last frame in batch of 8
Hybrid scheduling dedicates GPU contexts with batch size 1 for 10% high priority streams achieving p99 under 50ms, while batching remaining 90% at 3x throughput
Admission control tracks memory budget per GPU, T4 with 16 GB running 500 MB model at batch 8 requires 8 GB for activations, rejects new streams when exhausted
INT8 quantization reduces ResNet-50 inference from 25ms to 8ms on T4 with TensorRT, a 3x speedup, while degrading mAP by 1 to 3 percentage points
Timeout based batching waits up to 50ms or until 8 frames arrive, whichever comes first, bounding worst case latency while improving average throughput
Pre warming loads model weights and allocates buffers during startup, avoiding 2 to 5 second cold start latency on first inference request
📌 Examples
Netflix video analysis pipeline uses dedicated contexts for 5% of streams requiring sub 100ms face detection latency, batches remaining 95% with size 16 and 100ms timeout for 4x higher throughput
Autonomous vehicle perception runs INT8 quantized detector at 8ms per frame for 30 FPS real time output, samples 1 in 10 frames for full precision verification to maintain recall above 98%
← Back to Real-time Video Processing Overview