Training Infrastructure & PipelinesHyperparameter Optimization at ScaleHard⏱️ ~3 min

Production HPO System Architecture

Control Plane

A production HPO system splits into control plane and data plane. The control plane contains the suggestion service that maintains study state, search algorithm logic, and acquisition decisions. For Bayesian Optimization, this implements batch acquisitions like q Expected Improvement with penalization around pending and evaluated points to avoid redundant exploration. For categorical and conditional spaces, it uses tree structured or mixed surrogate models. The suggestion service must support asynchronous requests where workers pull suggestions without waiting for others.

Scheduler and Early Stopping

The scheduler places trials on compute using bin packing for multi GPU trials, enforces quotas for multi tenant fairness, and implements preemption policies. At scale with 256 to 512 workers, the scheduler becomes a throughput bottleneck if centralized. Production systems shard the suggest and evaluate services. The early stopping service consumes partial metrics streamed from workers and applies pruning rules. For ASHA, it maintains rungs at increasing fidelities and promotes top 20 to 30 percent based on intermediate objective.

Data Plane

The data plane consists of workers that pull suggestions, run training with periodic metric reporting every N steps (typically every 100 to 500 gradient steps), checkpoint frequently to survive spot preemptions, and expose health signals like loss divergence or out of memory errors.

Metadata and Lineage

The metadata store persists search spaces, seeds, suggestions, intermediate metrics, checkpoints, and final artifacts with full lineage connecting model to hyperparameters, dataset version, and code commit. This lineage is critical for reproducibility and auditability when debugging production model quality degradation.

💡 Key Takeaways
Suggestion service becomes throughput bottleneck above 1000 workers sending metric updates and requesting new configs; sharding or eventual consistency required to avoid queueing and GPU underutilization
ASHA maintains 3 to 5 rungs at increasing fidelity with downsampling factor 3 to 5 per rung; promoting top 20 to 30% at each checkpoint achieves 70 to 95% pruning after 10 to 30% budget
Workers checkpoint every 2 to 10 minutes to survive spot preemptions; checkpoint overhead (storage I/O) must stay under 5% of training time or it degrades throughput
Metadata store must capture full lineage including random seeds, dataset snapshots, feature versions, and code commits; without this, debugging 2 to 5% metric variance between runs becomes impossible
Batch acquisition for Bayesian Optimization penalizes regions around pending points to avoid redundant exploration; without penalization, parallel workers often evaluate near identical configs wasting budget
Early stopping service needs to handle noisy metrics by smoothing over windows (moving average of last 500 steps) or using quantile based promotions to avoid pruning late bloomers that converge slowly
📌 Interview Tips
1Google Vizier implements asynchronous suggestions with median stopping rule; for expensive deep models, batches of 8 to 32 parallel trials reduce full fidelity evaluations by 3 to 10x versus random search
2Uber Michelangelo AutoTune provides study level APIs to define search spaces and objectives, supporting warm start from historical metadata and cost aware scheduling across heterogeneous CPU and GPU pools
3Netflix workflow engine orchestrates thousands of parallel tasks on elastic cloud with checkpointing to survive spot preemptions affecting 50 to 70% of instances within 2 hours
← Back to Hyperparameter Optimization at Scale Overview