ML Model OptimizationNeural Architecture Search (NAS)Hard⏱️ ~3 min

Device Aware Latency Modeling for NAS

Accuracy alone does not determine the best architecture. Production systems have latency budgets, memory limits, and power constraints. Device-aware NAS incorporates these constraints directly into the search.

Why Latency Modeling Matters

Theoretical complexity (FLOPs) poorly predicts real latency. A model with 2x more FLOPs might run only 1.2x slower due to better memory access patterns. Conversely, memory-bound operations like depthwise convolutions have few FLOPs but high latency on some hardware.

Different hardware has different bottlenecks. A GPU is compute-bound; a mobile CPU is memory-bound. An architecture optimal for GPU may be terrible on mobile. Device-aware NAS searches for architectures optimized for YOUR deployment target.

Latency Modeling Approaches

Lookup tables: Measure latency of each operation type on target hardware. Sum operations for total latency estimate. Fast but ignores operation interactions and memory effects.

Learned predictors: Train a neural network to predict latency from architecture description. More accurate (captures interactions) but requires thousands of real measurements to train.

Direct measurement: Run each candidate on target device. Most accurate but slowest. Use only for final candidates.

Key metric: Latency predictor error should be under 10%. Higher error means NAS wastes compute exploring architectures that will not meet constraints.

💡 Key Takeaways
FLOPs poorly predict latency: 2x FLOPs may only be 1.2x slower
GPU is compute-bound, mobile CPU is memory-bound: same architecture performs differently
Lookup tables are fast but inaccurate; learned predictors need training data
Latency predictor error should be under 10% for effective search
📌 Interview Tips
1Interview Tip: Explain why FLOPs are a poor proxy for real latency
2Interview Tip: Describe how to build a latency lookup table for a target device
3Interview Tip: Discuss the data collection process for training a learned latency predictor
← Back to Neural Architecture Search (NAS) Overview