Training Infrastructure & Pipelines • Hyperparameter Optimization at ScaleHard⏱️ ~3 min
Production HPO System Architecture
A production HPO system splits into control plane and data plane. The control plane contains the suggestion service that maintains study state, search algorithm logic, and acquisition decisions. For Bayesian Optimization, this implements batch acquisitions like q Expected Improvement with penalization around pending and evaluated points to avoid redundant exploration. For categorical and conditional spaces (optimizer specific hyperparameters like Adam beta values only relevant when optimizer equals Adam), it uses tree structured or mixed surrogate models. The suggestion service must support asynchronous requests where workers pull suggestions without waiting for others to complete.
The scheduler places trials on compute using bin packing for multi GPU trials, enforces quotas for multi tenant fairness, and implements preemption policies. At scale with 256 to 512 workers, the scheduler becomes a throughput bottleneck if centralized. Production systems shard the suggest and evaluate services or use eventual consistency for non critical telemetry to avoid queueing. The early stopping service consumes partial metrics streamed from workers and applies pruning rules. For ASHA, it maintains rungs at increasing fidelities (for example, rung 1 at 10% budget, rung 2 at 30%, rung 3 at 100%) and promotes top 20 to 30% based on intermediate objective like validation loss or Mean Average Precision at step 5000.
The data plane consists of workers that pull suggestions, run training with periodic metric reporting every N steps (typically every 100 to 500 gradient steps or 30 to 60 seconds), checkpoint frequently to survive spot preemptions, and expose health signals like loss divergence or out of memory errors. The metadata store persists search spaces, seeds, suggestions, intermediate metrics, checkpoints, and final artifacts with full lineage connecting model to hyperparameters, dataset version, and code commit. Google Vizier and Meta Ax both emphasize this lineage for reproducibility and auditability, critical when debugging why production model quality degraded.
💡 Key Takeaways
•Suggestion service becomes throughput bottleneck above 1000 workers sending metric updates and requesting new configs; sharding or eventual consistency required to avoid queueing and GPU underutilization
•ASHA maintains 3 to 5 rungs at increasing fidelity with downsampling factor 3 to 5 per rung; promoting top 20 to 30% at each checkpoint achieves 70 to 95% pruning after 10 to 30% budget
•Workers checkpoint every 2 to 10 minutes to survive spot preemptions; checkpoint overhead (storage I/O) must stay under 5% of training time or it degrades throughput
•Metadata store must capture full lineage including random seeds, dataset snapshots, feature versions, and code commits; without this, debugging 2 to 5% metric variance between runs becomes impossible
•Batch acquisition for Bayesian Optimization penalizes regions around pending points to avoid redundant exploration; without penalization, parallel workers often evaluate near identical configs wasting budget
•Early stopping service needs to handle noisy metrics by smoothing over windows (moving average of last 500 steps) or using quantile based promotions to avoid pruning late bloomers that converge slowly
📌 Examples
Google Vizier implements asynchronous suggestions with median stopping rule; for expensive deep models, batches of 8 to 32 parallel trials reduce full fidelity evaluations by 3 to 10x versus random search
Uber Michelangelo AutoTune provides study level APIs to define search spaces and objectives, supporting warm start from historical metadata and cost aware scheduling across heterogeneous CPU and GPU pools
Netflix workflow engine orchestrates thousands of parallel tasks on elastic cloud with checkpointing to survive spot preemptions affecting 50 to 70% of instances within 2 hours