Device Aware Latency Modeling for NAS
Accuracy alone does not determine the best architecture. Production systems have latency budgets, memory limits, and power constraints. Device-aware NAS incorporates these constraints directly into the search.
Why Latency Modeling Matters
Theoretical complexity (FLOPs) poorly predicts real latency. A model with 2x more FLOPs might run only 1.2x slower due to better memory access patterns. Conversely, memory-bound operations like depthwise convolutions have few FLOPs but high latency on some hardware.
Different hardware has different bottlenecks. A GPU is compute-bound; a mobile CPU is memory-bound. An architecture optimal for GPU may be terrible on mobile. Device-aware NAS searches for architectures optimized for YOUR deployment target.
Latency Modeling Approaches
Lookup tables: Measure latency of each operation type on target hardware. Sum operations for total latency estimate. Fast but ignores operation interactions and memory effects.
Learned predictors: Train a neural network to predict latency from architecture description. More accurate (captures interactions) but requires thousands of real measurements to train.
Direct measurement: Run each candidate on target device. Most accurate but slowest. Use only for final candidates.
Key metric: Latency predictor error should be under 10%. Higher error means NAS wastes compute exploring architectures that will not meet constraints.