Four Core Patterns of Hardware Aware Optimization

Pattern 1: Memory-Efficient Architectures
Depthwise separable convolutions replace standard convolutions: instead of one 3×3×C filter, use one 3×3×1 per channel then 1×1×C to combine. Reduces parameters 8-9x and compute proportionally. Grouped convolutions split channels into groups processed independently. Inverted residuals (expand → depthwise → project) reduce peak memory by keeping narrow bottlenecks. MobileNets use all three; V3 achieves ImageNet accuracy with 5.4M parameters versus ResNet"s 25M.
Pattern 2: Latency-Optimized Operations
Some operations have high theoretical efficiency but poor hardware utilization. Batch normalization requires per-batch statistics that serialize computation. Fuse BN into preceding conv at inference (absorb mean/variance into weights). Avoid operations with irregular memory access: gather, scatter, dynamic indexing. Prefer operations that map to hardware primitives: matmul, conv (which have dedicated tensor cores).
Pattern 3: Compute-Precision Matching
Modern GPUs have tensor cores optimized for specific precisions. A100s: FP16 and INT8 tensor cores. Older GPUs: FP32 only. Design models knowing target precision: channels divisible by 8 or 16 align with tensor core requirements; mixed-precision training from start avoids accuracy loss during quantization; some operations (softmax, layer norm) need FP32 even on INT8 models.
Pattern 4: Parallelism-Friendly Design
Sequential dependencies block parallelism. Recurrent layers (LSTM, GRU) process time steps serially. Transformers process all positions in parallel but have quadratic attention complexity. Design for your parallelism budget: replace RNNs with 1D convolutions or Transformers when hardware supports parallel compute; use local attention (window-based) for long sequences on memory-limited devices.

💡 Key Takeaways

✓Depthwise separable convolutions reduce parameters 8-9x (MobileNet: 5.4M vs ResNet: 25M)

✓Fuse batch normalization into preceding conv at inference to avoid serialization

✓Avoid irregular memory access (gather, scatter); prefer tensor core primitives (matmul, conv)

✓Design channels divisible by 8 or 16 to align with tensor core requirements

✓Replace RNNs with parallel alternatives (1D conv, Transformers) when hardware supports it

📌 Interview Tips

1Describe depthwise separable convolutions with specific parameter reduction (8-9x) for technical depth

2Mention BN fusion as inference optimization - shows practical deployment knowledge

3Explain channel alignment (divisible by 8/16) for tensor cores - a detail that impresses interviewers

← Back to Hardware-Aware Optimization Overview