Training Infrastructure & PipelinesHyperparameter Optimization at ScaleHard⏱️ ~3 min

Bayesian Optimization vs ASHA: When to Use Each

When to Use Bayesian Optimization

Bayesian Optimization excels when each evaluation is expensive (minutes to hours), search spaces are modestly sized at 50 or fewer effective dimensions, and you can run parallel batches of 8 to 64 trials. BO builds a surrogate model (commonly Gaussian Process or random forest) that learns which regions of hyperparameter space are promising, then uses acquisition functions to balance exploration and exploitation. Meta's Ax commonly seeds with 20 to 50 Sobol quasi random points, then iterates with batches of 8 to 64. The limitation is that BO struggles above 64 parallel workers because batch acquisition quality degrades without sophisticated penalization.

When to Use ASHA

ASHA works best when you can define a meaningful fidelity axis like epochs, gradient steps, or data fraction, and you need to scale to hundreds or thousands of workers. ASHA allocates small budgets to many configs and promotes only the top 20 to 30 percent at each rung based on intermediate metrics. Production deployments commonly see 70 to 95 percent of trials pruned after consuming 10 to 30 percent of their full budget, cutting costs by 60 to 70 percent. ASHA achieves near linear wall clock speedup with concurrency and maintains over 80 percent GPU utilization.

Population Based Training

For nonstationary settings where optimal hyperparameters change during training (learning rate schedules, data augmentation intensity), Population Based Training (PBT) offers continuous adaptation. PBT co trains a population of 20 to 80 models, periodically copying weights from top performers and perturbing hyperparameters. DeepMind reported 1.5 to 3 times wall clock speedup on reinforcement learning and language modeling.

ASHA Trade-off

The tradeoff with ASHA is that it can prune late bloomers if the fidelity proxy (performance at 10 percent budget) correlates weakly with final performance (correlation under 0.6 makes pruning unreliable).

💡 Key Takeaways
Use Bayesian Optimization when trials cost 30 minutes to 8 hours each and you run 8 to 64 concurrent workers; expect 3 to 10x fewer full budget trials than random search for similar quality
Choose ASHA when you can define fidelity axis (epochs, steps, tokens) and need to scale to 100+ workers; ASHA prunes 70 to 95% of configs after 10 to 30% budget and achieves 80%+ GPU utilization
Bayesian Optimization batch quality degrades above 64 parallel workers unless using trust regions or strong penalization around pending points; large batches revert toward random exploration
ASHA requires fidelity proxy (performance at 10% budget) to correlate above 0.6 with final performance; weak correlation causes pruning of late bloomers that start slow but converge best
Population Based Training suits nonstationary problems where optimal hyperparameters shift during training (learning rate decay, augmentation schedules); achieves 1.5 to 3x wall clock speedup but requires full population running simultaneously
Multi objective constraints (maximize accuracy subject to latency under 100 milliseconds or memory under 4 gigabytes) require constrained Bayesian Optimization or Pareto methods; expect to need 50 to 100 initial seeds to model feasible regions
📌 Interview Tips
1Meta Ax uses batch Bayesian Optimization with 8 to 64 concurrent trials for ranking models where single trial takes 2 to 8 GPU hours; 100 trial campaign completes in under 24 hours on 256 GPU pool with early stopping
2Google Vizier combines Bayesian Optimization with median stopping to prune 60 to 90% of trials; for expensive deep models, this reduces full fidelity evaluations by 3 to 10x versus random baseline
3Capital One parallelized GAN training with hundreds of trials using bandit based scheduler, reducing tuning from weeks to under 1 day and achieving 30% higher success rate than manual tuning
← Back to Hyperparameter Optimization at Scale Overview