AdamW has been the default optimizer for transformer pretraining. For many years, our community searched for faster and more stable optimizers with only constrained positive outcomes. In this work, we propose a \textbf{one-line modification in Pytorch} to any momentum-based optimizer, which we rename cautious optimizer, e.g. C-AdamW and C-Lion. Our theoretical result shows that this modification preserves Adam's Hamiltonian function and it does not break the convergence guarantee under the Lyapunov analysis. In addition, a whole new family of optimizers is revealed by our theoretical insight. Among them, we pick the simplest one for empirical experiments, showing not only consistent speed-up on LLM pretraining, but also image classification, with minimum extra tuning on hyperparameters. Code is available at https://github.com/kyleliang919/C-Optim.
翻译:AdamW 长期以来一直是 Transformer 预训练中的默认优化器。多年来,研究社区一直在寻找更快、更稳定的优化器,但成果有限。本工作中,我们提出了一种适用于任何基于动量的优化器的 \textbf{单行 PyTorch 修改},并将其重命名为谨慎优化器,例如 C-AdamW 和 C-Lion。我们的理论结果表明,该修改保持了 Adam 的哈密顿函数,并且在李雅普诺夫分析下不会破坏收敛性保证。此外,我们的理论洞察揭示了一个全新的优化器家族。其中,我们选取最简单的一种进行实证实验,结果表明其不仅在大型语言模型预训练中实现了持续的加速,在图像分类任务中也表现优异,且仅需对超参数进行极少的额外调整。代码发布于 https://github.com/kyleliang919/C-Optim。