Production HPO System Architecture
Control Plane
A production HPO system splits into control plane and data plane. The control plane contains the suggestion service that maintains study state, search algorithm logic, and acquisition decisions. For Bayesian Optimization, this implements batch acquisitions like q Expected Improvement with penalization around pending and evaluated points to avoid redundant exploration. For categorical and conditional spaces, it uses tree structured or mixed surrogate models. The suggestion service must support asynchronous requests where workers pull suggestions without waiting for others.
Scheduler and Early Stopping
The scheduler places trials on compute using bin packing for multi GPU trials, enforces quotas for multi tenant fairness, and implements preemption policies. At scale with 256 to 512 workers, the scheduler becomes a throughput bottleneck if centralized. Production systems shard the suggest and evaluate services. The early stopping service consumes partial metrics streamed from workers and applies pruning rules. For ASHA, it maintains rungs at increasing fidelities and promotes top 20 to 30 percent based on intermediate objective.
Data Plane
The data plane consists of workers that pull suggestions, run training with periodic metric reporting every N steps (typically every 100 to 500 gradient steps), checkpoint frequently to survive spot preemptions, and expose health signals like loss divergence or out of memory errors.
Metadata and Lineage
The metadata store persists search spaces, seeds, suggestions, intermediate metrics, checkpoints, and final artifacts with full lineage connecting model to hyperparameters, dataset version, and code commit. This lineage is critical for reproducibility and auditability when debugging production model quality degradation.