TensorRT: NVIDIA GPU Specific Optimization
Key Optimizations
Layer fusion: combines Conv, BatchNorm, ReLU into single kernel. Precision calibration: automatically converts FP32 to FP16 or INT8 with minimal accuracy loss. Kernel autotuning: profiles multiple implementations per layer, selects fastest for your specific GPU. Memory optimization: reuses tensor buffers, reducing peak memory 40-60%. Typical speedups over PyTorch: 3-8x. Combined with INT8: 10-20x.
The Build Process
TensorRT compiles ONNX (or native TF/PyTorch) into a GPU-specific "engine." This engine is not portable; it"s optimized for the exact GPU and driver version used during build. Change GPUs, rebuild the engine. Build time: seconds for small models, minutes to hours for large ones. The engine contains preselected kernels and memory layouts, so inference has near-zero startup overhead.
Limitations
NVIDIA GPUs only. Engine files are GPU-specific and driver-version sensitive. Not all operations are supported; unsupported ops fall back to slower generic implementations or fail. Dynamic shapes require explicit dimension ranges at build time. Debugging is difficult since the engine is a binary blob.