Training Infrastructure & Pipelines • Hyperparameter Optimization at ScaleHard⏱️ ~3 min
Bayesian Optimization vs ASHA: When to Use Each
Bayesian Optimization excels when each evaluation is expensive (minutes to hours), search spaces are modestly sized at 50 or fewer effective dimensions, and you can run parallel batches of 8 to 64 trials. BO builds a surrogate model (commonly Gaussian Process or random forest) that learns which regions of hyperparameter space are promising, then uses acquisition functions to balance exploration and exploitation. In practice, BO requires 3 to 10 times fewer full budget trials than random search to reach similar quality. Meta's Ax commonly seeds with 20 to 50 Sobol quasi random points, then iterates with batches of 8 to 64 using q Expected Improvement. The limitation is that BO struggles above 64 parallel workers because batch acquisition quality degrades without sophisticated penalization or trust region methods.
ASHA (Asynchronous Successive Halving Algorithm) works best when you can define a meaningful fidelity axis like epochs, gradient steps, or data fraction, and you need to scale to hundreds or thousands of workers. ASHA allocates small budgets to many configs and promotes only the top 20 to 30% at each rung based on intermediate metrics. Production deployments commonly see 70 to 95% of trials pruned after consuming 10 to 30% of their full budget, which cuts costs by 60 to 70%. ASHA achieves near linear wall clock speedup with concurrency and maintains over 80% GPU utilization even with stragglers and spot preemptions because workers never wait synchronously. The tradeoff is that ASHA can prune late bloomers if the fidelity proxy (performance at 10% budget) correlates weakly with final performance (correlation under 0.6 makes pruning unreliable).
For nonstationary settings where optimal hyperparameters change during training (learning rate schedules, data augmentation intensity), Population Based Training (PBT) offers continuous adaptation. PBT co trains a population of 20 to 80 models, periodically copying weights from top performers and perturbing hyperparameters every 1 to 5 epochs. DeepMind reported 1.5 to 3 times wall clock speedup on reinforcement learning and language modeling. PBT trades statistical sample efficiency for fast wall clock progress and requires more compute headroom since the full population trains simultaneously.
💡 Key Takeaways
•Use Bayesian Optimization when trials cost 30 minutes to 8 hours each and you run 8 to 64 concurrent workers; expect 3 to 10x fewer full budget trials than random search for similar quality
•Choose ASHA when you can define fidelity axis (epochs, steps, tokens) and need to scale to 100+ workers; ASHA prunes 70 to 95% of configs after 10 to 30% budget and achieves 80%+ GPU utilization
•Bayesian Optimization batch quality degrades above 64 parallel workers unless using trust regions or strong penalization around pending points; large batches revert toward random exploration
•ASHA requires fidelity proxy (performance at 10% budget) to correlate above 0.6 with final performance; weak correlation causes pruning of late bloomers that start slow but converge best
•Population Based Training suits nonstationary problems where optimal hyperparameters shift during training (learning rate decay, augmentation schedules); achieves 1.5 to 3x wall clock speedup but requires full population running simultaneously
•Multi objective constraints (maximize accuracy subject to latency under 100 milliseconds or memory under 4 gigabytes) require constrained Bayesian Optimization or Pareto methods; expect to need 50 to 100 initial seeds to model feasible regions
📌 Examples
Meta Ax uses batch Bayesian Optimization with 8 to 64 concurrent trials for ranking models where single trial takes 2 to 8 GPU hours; 100 trial campaign completes in under 24 hours on 256 GPU pool with early stopping
Google Vizier combines Bayesian Optimization with median stopping to prune 60 to 90% of trials; for expensive deep models, this reduces full fidelity evaluations by 3 to 10x versus random baseline
Capital One parallelized GAN training with hundreds of trials using bandit based scheduler, reducing tuning from weeks to under 1 day and achieving 30% higher success rate than manual tuning