ML Model OptimizationNeural Architecture Search (NAS)Hard⏱️ ~3 min

Device Aware Latency Modeling for NAS

Device aware NAS requires accurate latency prediction for candidate architectures without deploying every model to physical hardware during search. The standard approach builds a latency lookup table (LUT) by measuring per operator costs on target devices. For a mobile phone target, teams measure roughly 10,000 operator and shape pairs: convolution kernels (3x3, 5x5, 7x7), depthwise separable convolutions, pooling operations, activations, and batch normalization, across input shapes spanning 14x14 to 224x224 spatial dimensions and 16 to 512 channels. Each configuration runs 30 times with 10 warmup iterations discarded, recording median and 95th percentile latency to account for thermal throttling, background processes, and operating system scheduling noise. During search, the system estimates network latency by summing per operator costs from the LUT, applying corrections for operator fusion patterns that modern compilers perform. For example, convolution followed by batch normalization and ReLU typically fuses into a single kernel, reducing latency by 15 to 30 percent compared to naive summation. Advanced approaches use meta learned regressors like those in FBNet, which train a neural network or gradient boosted tree to predict total latency from architecture features (depth, width, operation mix, memory access patterns), achieving mean absolute percentage error around 8 to 12 percent. The critical failure mode is measurement noise and device drift. A Google Pixel phone under active thermal management can show latency variance of plus or minus 20 percent for the same model. Production systems mitigate this by pinning inference threads to specific CPU cores, disabling dynamic frequency scaling during measurement, using dedicated measurement devices isolated from user workloads, and maintaining temperature controlled environments. Device drift across markets creates another challenge: chip bins from different manufacturing batches can vary by 10 to 15 percent in performance. Teams refresh LUTs quarterly and validate on representative device samples from each target market. For distributed measurement, a farm of 32 phones per device type provides statistical confidence, with outlier rejection removing the top and bottom 10 percent of measurements before computing final statistics.
💡 Key Takeaways
Latency lookup table (LUT) measures 10,000 operator and shape pairs on target device: 30 runs each with 10 warmup discards, recording median and 95th percentile
Operator fusion corrections reduce naive summation by 15 to 30 percent: convolution plus batch normalization plus ReLU typically fuse into single kernel
Meta learned regressors like FBNet achieve 8 to 12 percent mean absolute percentage error by training neural networks or gradient boosted trees on architecture features
Measurement noise from thermal throttling and background processes causes plus or minus 20 percent variance; mitigation includes CPU core pinning, disabled dynamic frequency scaling, and temperature controlled environments
Device drift across chip manufacturing batches varies by 10 to 15 percent; production systems use 32 phone farms per device type with quarterly LUT refreshes and outlier rejection (top and bottom 10%)
📌 Examples
Google Pixel LUT: 10,000 entries covering Conv 3x3 to 7x7, depthwise separable, pooling, activations across 14x14 to 224x224 spatial dimensions and 16 to 512 channels
Fusion optimization: Conv 3x3 (12.3ms) + BN (3.2ms) + ReLU (1.8ms) = 17.3ms naive sum, but fused kernel runs in 12.1ms (30% reduction)
FBNet meta learned latency predictor: Trained on 5000 measured architectures, achieves 8 to 12% MAPE, predicts latency in 50 microseconds versus 30 seconds for device measurement
← Back to Neural Architecture Search (NAS) Overview