What is Hardware Aware Optimization in ML?
Why Hardware Constraints Matter
A model optimized for V100 GPUs may run 3x slower on T4s due to different memory bandwidth and tensor core generations. Mobile chips have 10-100x less compute than cloud GPUs. Edge devices often lack FP16 support or have unusual memory hierarchies. Building without hardware awareness leads to: models too slow for latency SLAs, models too large for device memory, expensive rewrites after discovering constraints late.
The Traditional Approach Fails
Traditional workflow: design model → train → optimize for deployment. This fails because: architectural choices made during design may be fundamentally incompatible with target hardware; post-hoc optimizations can only recover 20-30% of potential speedup; you discover constraints too late when changing architecture is expensive. Hardware-aware approach: define hardware constraints first → search for architectures that fit → train within constraints.
Key Constraint Types
Latency: maximum inference time (e.g., 10ms for real-time). Throughput: minimum requests per second. Memory: model size + activation memory must fit device RAM. Power: critical for mobile/edge (watts consumed). Cost: cloud inference cost per 1000 requests. Each constraint shapes architectural choices differently.