GPU Inference Scheduling and Batching Strategies
Dynamic Batching for Video
GPUs process batches more efficiently than individual frames. A batch of 8 frames takes 20ms; 8 individual frames take 80ms. But batching adds latency since frames wait for the batch to fill.
Batch formation strategies: Wait for N frames (fixed batch size) or wait T milliseconds (timeout). Fixed size maximizes throughput but adds variable latency. Timeout caps latency but produces variable batch sizes.
Multi-Stream Batching
When processing multiple cameras, batch frames from different streams together. Camera A and Camera B each contribute 4 frames to an 8-frame batch. Both streams benefit from GPU efficiency without either waiting too long.
Stream prioritization: Some cameras matter more than others. Entrance cameras get priority over parking lot cameras. Priority affects batch ordering and timeout handling.
GPU Memory Management
Pre-allocation: Allocate GPU memory at startup. Avoid runtime allocation that causes fragmentation and unpredictable latency.
Double buffering: While GPU processes batch N, CPU prepares batch N+1. Hides preprocessing latency by overlapping CPU and GPU work.
Throughput vs Latency Tuning
Maximum throughput: Large batches (16-32 frames), long timeouts (50-100ms). Use for offline analysis or low-priority streams.
Minimum latency: Small batches (4-8 frames), short timeouts (10-20ms). Use for real-time alerts or safety-critical applications.