HPO Failure Modes and Production Mitigations
Validation Overfitting
Validation overfitting occurs when repeatedly evaluating the same validation set inflates metrics artificially. A hyperparameter search might try 200 configurations all evaluated on the same 10,000 examples, effectively doing statistical multiple testing without correction. Symptoms include models showing 2 to 5 percent better validation metrics but large drops on holdout test sets. Mitigations include nested validation where each trial uses a different validation fold, maintaining a final unseen test set, or using repeated K fold cross validation.
Noisy Objectives
Noisy or nonstationary objectives break both Bayesian Optimization and early stopping assumptions. Training with stochastic gradient descent, data shuffling, and hardware nondeterminism causes 2 to 5 percent metric variance between identical configs. Early stopping can prune late bloomers whose validation loss oscillates early but converges better after 50 percent budget. BO surrogates overfit the noise, predicting improvements that never materialize. Mitigations include quantile based promotions in ASHA, replicating promising configs 2 to 3 times and averaging results, and smoothing metrics over windows.
Runtime Skew
Runtime skew from heterogeneous resources causes unfair pruning. Larger models or slower GPU types run behind schedule and get pruned in time based rungs despite having better loss trajectories. A trial on a V100 GPU might reach step 5000 in 2 hours while the same config on an A100 reaches it in 1 hour; time based early stopping would prune the V100 trial unfairly.
Mitigations
Define fidelity by fixed steps or epochs rather than wall clock time, normalize progress by tokens seen or gradient steps, record hardware metadata to compare like with like, and explicitly optimize objective per unit time when hardware cost matters.