We study adaptive learning rate scheduling for norm-constrained optimizers (e.g., Muon and Lion). We introduce a generalized smoothness assumption under which local curvature decreases with the suboptimality gap and empirically verify that this behavior holds along optimization trajectories. Under this assumption, we establish convergence guarantees under an appropriate choice of learning rate, for which warm-up followed by decay arises naturally from the proof rather than being imposed heuristically. Building on this theory, we develop a practical learning rate scheduler that relies only on standard hyperparameters and adapts the warm-up duration automatically at the beginning of training. We evaluate this method on large language model pretraining with LLaMA architectures and show that our adaptive warm-up selection consistently outperforms or at least matches the best manually tuned warm-up schedules across all considered setups, without additional hyperparameter search. Our source code is available at https://github.com/brain-lab-research/llm-baselines/tree/warmup
翻译:本文研究了范数约束优化器(如Muon和Lion)的自适应学习率调度方法。我们提出了一种广义平滑性假设,在该假设下局部曲率随次优性间隙减小,并通过实验验证了该性质在优化轨迹中的存在性。基于此假设,我们建立了在适当学习率选择下的收敛性保证,其中预热阶段与衰减阶段的出现源于理论证明而非启发式设定。基于该理论,我们开发了一种仅依赖标准超参数、在训练开始时自动调整预热时长的实用学习率调度器。我们在LLaMA架构的大语言模型预训练任务中评估该方法,结果表明:在所有实验设置下,我们的自适应预热选择始终优于或至少等同于手动调优的最佳预热方案,且无需额外的超参数搜索。源代码发布于 https://github.com/brain-lab-research/llm-baselines/tree/warmup