Four Core Patterns of Hardware Aware Optimization
Pattern 1: Memory-Efficient Architectures
Depthwise separable convolutions replace standard convolutions: instead of one 3×3×C filter, use one 3×3×1 per channel then 1×1×C to combine. Reduces parameters 8-9x and compute proportionally. Grouped convolutions split channels into groups processed independently. Inverted residuals (expand → depthwise → project) reduce peak memory by keeping narrow bottlenecks. MobileNets use all three; V3 achieves ImageNet accuracy with 5.4M parameters versus ResNet"s 25M.
Pattern 2: Latency-Optimized Operations
Some operations have high theoretical efficiency but poor hardware utilization. Batch normalization requires per-batch statistics that serialize computation. Fuse BN into preceding conv at inference (absorb mean/variance into weights). Avoid operations with irregular memory access: gather, scatter, dynamic indexing. Prefer operations that map to hardware primitives: matmul, conv (which have dedicated tensor cores).
Pattern 3: Compute-Precision Matching
Modern GPUs have tensor cores optimized for specific precisions. A100s: FP16 and INT8 tensor cores. Older GPUs: FP32 only. Design models knowing target precision: channels divisible by 8 or 16 align with tensor core requirements; mixed-precision training from start avoids accuracy loss during quantization; some operations (softmax, layer norm) need FP32 even on INT8 models.
Pattern 4: Parallelism-Friendly Design
Sequential dependencies block parallelism. Recurrent layers (LSTM, GRU) process time steps serially. Transformers process all positions in parallel but have quadratic attention complexity. Design for your parallelism budget: replace RNNs with 1D convolutions or Transformers when hardware supports parallel compute; use local attention (window-based) for long sequences on memory-limited devices.