Learn→Training Infrastructure & Pipelines→Hyperparameter Optimization at Scale→2 of 6

Training Infrastructure & Pipelines • Hyperparameter Optimization at ScaleMedium⏱️ ~3 min

Core HPO Algorithms: Random vs Bayesian vs Multi Fidelity

Random and Grid Search
Random and grid search serve as robust baselines that are embarrassingly parallel but inefficient. Grid search scales exponentially with dimensions (10 values across 5 hyperparameters requires 100,000 trials), making it impractical beyond 3 to 4 dimensions. Random search finds good regions faster by sampling the space uniformly, but without pruning it wastes budget evaluating unpromising configurations to full fidelity.
Bayesian Optimization
Bayesian Optimization (BO) builds a surrogate model of the objective using Gaussian Processes, Tree structured Parzen Estimators, or random forest surrogates, then selects new points via acquisition functions like Expected Improvement. BO shines when each evaluation is expensive (minutes to hours) and parallelism is moderate at 8 to 64 concurrent trials. In practice, BO requires 3 to 10 times fewer full budget trials than random search to reach similar quality. The tradeoff is that BO requires careful handling of categorical and conditional spaces.
Multi Fidelity Methods
Bandit based multi fidelity methods like Successive Halving, Hyperband, and ASHA allocate small budgets (a fraction of epochs or subset of data) to many configs, then promote only the top quantiles to larger budgets. ASHA typically defines 3 to 5 rungs with downsampling factor of 3 to 5 per rung, promoting the top 20 to 30 percent at each checkpoint. This approach prunes 70 to 95 percent of trials after they consume only 10 to 30 percent of their full budget.
Trade-offs
Multi fidelity scales to hundreds of workers and achieves near linear wall clock speedup. The limitation is that it requires a meaningful fidelity axis and can occasionally prune late bloomers that start slow but converge better.

💡 Key Takeaways

✓Bayesian Optimization reduces full fidelity evaluations by 3 to 10x compared to random search but only effective with 8 to 64 concurrent workers; larger batches degrade suggestion quality unless using trust regions

✓ASHA prunes 70 to 95% of configs after 10 to 30% budget by promoting only top 20 to 30% at each rung, cutting costs by 60 to 70% while scaling to hundreds of workers

✓Grid search becomes impractical beyond 3 to 4 dimensions because trials grow exponentially; 10 values per dimension across 5 hyperparameters requires 100,000 evaluations

✓Multi fidelity requires meaningful correlation between cheap proxy (10% of epochs) and expensive full evaluation; if correlation is weak below 0.6, pruning decisions become unreliable

✓Bayesian Optimization can overfit noisy objectives; symptoms include surrogate model predicting improvements that do not materialize, requiring robustness aware acquisitions and metric smoothing

✓Random search remains competitive when search space is under 4 dimensions or evaluation is cheap (under 5 minutes), because BO overhead for surrogate modeling exceeds savings

📌 Interview Tips

1Meta Ax uses batch acquisitions with q Expected Improvement for 8 to 64 concurrent trials; for a ranking model taking 2 to 8 GPU hours per trial, 100 trial campaign completes in under 24 hours on 256 GPU pool

2Google Vizier applies median stopping rule that prunes trials performing worse than median at 10%, 30%, 50% progress checkpoints, eliminating 60 to 90% of trials before full budget

3DeepMind published Population Based Training with population size 20 to 80, exploiting and exploring every 1 to 5 epochs, achieving 1.5 to 3x wall clock speedup on reinforcement learning and language modeling tasks

← Back to Hyperparameter Optimization at Scale Overview