Bayesian Optimization vs ASHA: When to Use Each
When to Use Bayesian Optimization
Bayesian Optimization excels when each evaluation is expensive (minutes to hours), search spaces are modestly sized at 50 or fewer effective dimensions, and you can run parallel batches of 8 to 64 trials. BO builds a surrogate model (commonly Gaussian Process or random forest) that learns which regions of hyperparameter space are promising, then uses acquisition functions to balance exploration and exploitation. Meta's Ax commonly seeds with 20 to 50 Sobol quasi random points, then iterates with batches of 8 to 64. The limitation is that BO struggles above 64 parallel workers because batch acquisition quality degrades without sophisticated penalization.
When to Use ASHA
ASHA works best when you can define a meaningful fidelity axis like epochs, gradient steps, or data fraction, and you need to scale to hundreds or thousands of workers. ASHA allocates small budgets to many configs and promotes only the top 20 to 30 percent at each rung based on intermediate metrics. Production deployments commonly see 70 to 95 percent of trials pruned after consuming 10 to 30 percent of their full budget, cutting costs by 60 to 70 percent. ASHA achieves near linear wall clock speedup with concurrency and maintains over 80 percent GPU utilization.
Population Based Training
For nonstationary settings where optimal hyperparameters change during training (learning rate schedules, data augmentation intensity), Population Based Training (PBT) offers continuous adaptation. PBT co trains a population of 20 to 80 models, periodically copying weights from top performers and perturbing hyperparameters. DeepMind reported 1.5 to 3 times wall clock speedup on reinforcement learning and language modeling.
ASHA Trade-off
The tradeoff with ASHA is that it can prune late bloomers if the fidelity proxy (performance at 10 percent budget) correlates weakly with final performance (correlation under 0.6 makes pruning unreliable).