GPU Inference Scheduling and Batching Strategies

Dynamic Batching for Video
GPUs process batches more efficiently than individual frames. A batch of 8 frames takes 20ms; 8 individual frames take 80ms. But batching adds latency since frames wait for the batch to fill.
Batch formation strategies: Wait for N frames (fixed batch size) or wait T milliseconds (timeout). Fixed size maximizes throughput but adds variable latency. Timeout caps latency but produces variable batch sizes.
Multi-Stream Batching
When processing multiple cameras, batch frames from different streams together. Camera A and Camera B each contribute 4 frames to an 8-frame batch. Both streams benefit from GPU efficiency without either waiting too long.
Stream prioritization: Some cameras matter more than others. Entrance cameras get priority over parking lot cameras. Priority affects batch ordering and timeout handling.
GPU Memory Management
Pre-allocation: Allocate GPU memory at startup. Avoid runtime allocation that causes fragmentation and unpredictable latency.
Double buffering: While GPU processes batch N, CPU prepares batch N+1. Hides preprocessing latency by overlapping CPU and GPU work.
Throughput vs Latency Tuning
Maximum throughput: Large batches (16-32 frames), long timeouts (50-100ms). Use for offline analysis or low-priority streams.
Minimum latency: Small batches (4-8 frames), short timeouts (10-20ms). Use for real-time alerts or safety-critical applications.
⚠️ Key Trade-off: Larger batches increase throughput but add latency. The optimal batch size depends on your latency requirements, not maximum GPU utilization.

💡 Key Takeaways

✓Batching 8 frames takes 20ms vs 80ms for 8 individual frames - 4x efficiency improvement

✓Multi-stream batching combines frames from different cameras for better GPU utilization

✓Pre-allocation and double buffering hide memory and preprocessing latency

✓Large batches (16-32) for throughput; small batches (4-8) for latency-sensitive applications

📌 Interview Tips

1Interview Tip: Explain timeout-based batching to cap latency while still gaining batching benefits

2Interview Tip: Mention double buffering - CPU prepares batch N+1 while GPU processes batch N

← Back to Real-time Video Processing Overview