Implementing Hardware Aware Optimization: A Systematic Pipeline
The Search Pipeline
Define a search space: layer types (conv, pooling, attention), widths (channel counts), depths (layer counts). Define objectives: accuracy on validation set, latency on target hardware. Run search: sample architectures, train briefly (proxy task), measure objectives, update search algorithm. Search methods include reinforcement learning (sample based on predicted reward), evolutionary (mutate top performers), differentiable (gradient descent on architecture parameters). Expect 100-1000 GPU hours for full search.
Hardware-in-the-Loop
The key innovation: measure latency on actual target hardware during search, not predicted from FLOPs. A lookup table precomputes latency per operation type and size on target device. During search, sum latencies for candidate architecture. This catches hardware-specific effects: memory bandwidth bottlenecks, kernel launch overhead, cache behavior. Without hardware-in-the-loop, searched architectures are theoretically efficient but practically slow.
Production Workflow
Start with off-the-shelf efficient architectures (EfficientNet, MobileNet, RegNet) as baselines. If baselines meet requirements, stop. If not, run hardware-aware NAS with those architectures in the search space. Fine-tune the discovered architecture on full training data. Validate on target hardware under production conditions. Budget 2-4 weeks for the full process including NAS, training, and validation.