Training Infrastructure & Pipelines • Hyperparameter Optimization at ScaleMedium⏱️ ~3 min
Core HPO Algorithms: Random vs Bayesian vs Multi Fidelity
Random and grid search serve as robust baselines that are embarrassingly parallel but inefficient. Grid search scales exponentially with dimensions (10 values across 5 hyperparameters requires 100,000 trials), making it impractical beyond 3 to 4 dimensions. Random search finds good regions faster by sampling the space uniformly, but without pruning it wastes budget evaluating unpromising configurations to full fidelity.
Bayesian Optimization (BO) builds a surrogate model of the objective using Gaussian Processes, Tree structured Parzen Estimators, or random forest surrogates, then selects new points via acquisition functions like Expected Improvement. BO shines when each evaluation is expensive (minutes to hours) and parallelism is moderate at 8 to 64 concurrent trials. Meta's Ax system commonly seeds with 20 to 50 quasi random Sobol evaluations, then iterates with BO in batches of 8 to 64. For ranking models where a single trial takes 2 to 8 GPU hours on 8 GPUs, a 100 trial campaign runs in under a day on a 256 GPU pool with pruning versus multiple days without. The tradeoff is that BO requires careful handling of categorical and conditional spaces and can overfit noise.
Bandit based multi fidelity methods like Successive Halving, Hyperband, and Asynchronous Successive Halving Algorithm (ASHA) allocate small budgets (a fraction of epochs or subset of data) to many configs, then promote only the top quantiles to larger budgets. ASHA typically defines 3 to 5 rungs with downsampling factor of 3 to 5 per rung, promoting the top 20 to 30% at each checkpoint. This approach prunes 70 to 95% of trials after they consume only 10 to 30% of their full budget. Multi fidelity scales to hundreds or thousands of workers and achieves near linear wall clock speedup because workers never wait for stragglers. The limitation is that it requires a meaningful fidelity axis like epochs, gradient steps, or data fraction, and can occasionally prune late bloomers that start slow but converge better.
💡 Key Takeaways
•Bayesian Optimization reduces full fidelity evaluations by 3 to 10x compared to random search but only effective with 8 to 64 concurrent workers; larger batches degrade suggestion quality unless using trust regions
•ASHA prunes 70 to 95% of configs after 10 to 30% budget by promoting only top 20 to 30% at each rung, cutting costs by 60 to 70% while scaling to hundreds of workers
•Grid search becomes impractical beyond 3 to 4 dimensions because trials grow exponentially; 10 values per dimension across 5 hyperparameters requires 100,000 evaluations
•Multi fidelity requires meaningful correlation between cheap proxy (10% of epochs) and expensive full evaluation; if correlation is weak below 0.6, pruning decisions become unreliable
•Bayesian Optimization can overfit noisy objectives; symptoms include surrogate model predicting improvements that do not materialize, requiring robustness aware acquisitions and metric smoothing
•Random search remains competitive when search space is under 4 dimensions or evaluation is cheap (under 5 minutes), because BO overhead for surrogate modeling exceeds savings
📌 Examples
Meta Ax uses batch acquisitions with q Expected Improvement for 8 to 64 concurrent trials; for a ranking model taking 2 to 8 GPU hours per trial, 100 trial campaign completes in under 24 hours on 256 GPU pool
Google Vizier applies median stopping rule that prunes trials performing worse than median at 10%, 30%, 50% progress checkpoints, eliminating 60 to 90% of trials before full budget
DeepMind published Population Based Training with population size 20 to 80, exploiting and exploring every 1 to 5 epochs, achieving 1.5 to 3x wall clock speedup on reinforcement learning and language modeling tasks