Scalable Hyperparameter-Divergent Ensemble Training with Automatic Learning Rate Exploration for Large Models

Training large neural networks with data-parallel stochastic gradient descent allocates N GPU replicas to compute effectively identical updates -- a practice that leaves the rich space of learning rate configurations entirely unexplored during training. We propose Hyperparameter-Divergent Ensemble Training (HDET), a method that repurposes these replicas for simultaneous learning rate exploration at negligible communication overhead. HDET operates in alternating phases: a fan-out stage in which replicas train independently under a structured, symmetric spread of learning rates, and a converge stage in which parameters are averaged across all replicas via AllReduce every T steps. Building on this ensemble substrate, we further propose an automatic learning rate (auto-LR) controller that treats the relative training loss across replicas as a performance signal, updating the shared base schedule toward higher-performing configurations via a momentum-based gradient-free meta-update. The combined method produces a self-adapting learning rate schedule that improves both optimization quality and generalization without additional hyperparameter sweeps or training budget. Crucially, the framework generalizes beyond learning rate: any scalar hyperparameter that does not alter model architecture -- such as dropout rate, attention scale temperature, or weight-decay coefficient -- can be explored across replicas using the same fan-out/converge protocol, with inter-replica loss differences serving as zero-order hypergradients that guide the search direction. HDET is implemented as a drop-in replacement for PyTorch's OneCycleLR scheduler, requiring no changes to model architecture, optimizer, or data pipeline.

翻译：采用数据并行随机梯度下降训练大型神经网络时，N个GPU副本被分配执行计算上几乎相同的更新——这一做法使得训练过程中丰富学习率配置空间完全未被探索。我们提出超参数分歧集成训练方法（HDET），该方法将这些副本重新用于同时进行学习率探索，且仅产生可忽略的通信开销。HDET采用交替阶段运行：扇出阶段中，副本在结构化的对称学习率分布下独立训练；汇聚阶段中，每隔T步通过AllReduce对所有副本的参数进行平均。在此基础上，我们进一步提出自动学习率控制器，将副本间相对训练损失作为性能信号，通过基于动量的无梯度元更新，向高性能配置方向调整共享基础调度策略。组合方法可产生自适应学习率调度策略，在不增加超参数搜索或训练预算的情况下，同时提升优化质量与泛化能力。关键在于，该框架的适用性超越学习率：任何不改变模型架构的标量超参数（如丢弃率、注意力缩放温度或权重衰减系数）均可通过相同的扇出/汇聚协议在副本间进行探索，以副本间损失差异作为引导搜索方向的零阶超梯度。HDET可直接替代PyTorch的OneCycleLR调度器，无需修改模型架构、优化器或数据管道。