ML Model Optimization • Neural Architecture Search (NAS)Hard⏱️ ~3 min
NAS Failure Modes and Production Mitigations
Neural Architecture Search faces several critical failure modes that can invalidate results or cause production incidents. Ranking mismatch from supernet weight sharing is the most common: a candidate that ranks highly under shared weights can underperform when trained from scratch because weight sharing introduces co adaptation and biases toward certain operation types. Research shows Kendall tau correlation between supernet and standalone rankings around 0.4 to 0.6, meaning nearly half of relative orderings can be incorrect. Mitigation requires fair path sampling where each operation type appears equally often during supernet training, path dropout rates around 0.2 to prevent over fitting to specific paths, and re-ranking the top 10 to 20 candidates with partial independent training of 10 to 30 epochs before final selection.
Proxy overfitting occurs when searching on downsampled images or data subsets produces architectures that fail to scale. A model optimized for 160 pixel images can suffer accuracy drops of 2 to 5 percentage points when trained at 320 pixels because spatial reasoning patterns differ. Progressive evaluation mitigates this: include at least one search stage at target resolution, maintain consistent augmentation and regularization between search and final training, and validate on the full dataset for top candidates. DARTS style differentiable search can collapse to trivial operations like skip connections if the search space lacks regularization, with some runs producing networks that are 90 percent skip connections. Add operation level L2 penalties, limit skip connection ratios to 30 percent, or use early stopping when a single operation dominates beyond 60 percent probability mass.
Hardware measurement creates instability. On device latency varies by plus or minus 20 percent due to thermal throttling, background processes, and OS scheduling. Use dedicated measurement devices with CPU core pinning, disable dynamic frequency scaling, repeat measurements 30 times, discard the top and bottom 10 percent as outliers, and track both median and 95th percentile. Refresh latency lookup tables quarterly as operating system updates and driver changes affect performance. Reward hacking appears when multi objective weights misalign with business goals: you can meet median latency targets while exploding 95th percentile or battery drain. Google's Pixel deployment tracks not just median latency but also 95th percentile, peak memory consumption during inference, battery milliamp hours per 1000 inferences, and crash rates. Finally, reproducibility suffers from nondeterminism in distributed training. Archive data splits with content hashes, use fixed random seeds, enable deterministic CUDA kernels when possible, and retrain top 3 to 5 candidates multiple times to estimate variance, rejecting candidates with standard deviation above 0.5 percentage points in accuracy.
💡 Key Takeaways
•Ranking mismatch: Supernet weight sharing produces Kendall tau correlation of 0.4 to 0.6 with standalone training; mitigate with fair path sampling, 0.2 path dropout, and re ranking top 10 to 20 with independent 10 to 30 epoch training
•Proxy overfitting: Models optimized at 160 pixels drop 2 to 5 percentage points accuracy at 320 pixels; require at least one search stage at target resolution with consistent augmentation
•DARTS collapse: Differentiable search can produce 90% skip connections without regularization; add operation level L2 penalties and early stop when any operation exceeds 60% probability
•Hardware measurement instability: Device latency varies plus or minus 20%; use 30 measurement runs with top and bottom 10% outlier rejection, CPU pinning, disabled frequency scaling, and quarterly LUT refreshes
•Reward hacking: Median latency targets can be met while 95th percentile explodes; track full distribution including p95 latency, peak memory, battery milliamp hours per 1000 inferences, and crash rates
📌 Examples
Supernet ranking error: Candidate ranks 3rd under shared weights with 76.2% accuracy estimate, but when trained standalone only achieves 73.8% (ranks 15th overall)
Resolution scaling failure: Architecture optimized at 160 pixels achieves 74.5% accuracy, but at 320 pixels drops to 69.8% due to different spatial reasoning requirements
Google Pixel deployment metrics: Track median 78ms and p95 115ms latency, peak memory 180MB, battery 45 mAh per 1000 inferences, crash rate under 0.01%