We propose GradPower, a lightweight gradient-transformation technique for accelerating language model pre-training. Given a gradient vector $g=(g_i)_i$, GradPower first applies the elementwise sign-power transformation: $\varphi_p(g)=({\rm sign}(g_i)|g_i|^p)_{i}$ for a fixed $p>0$, and then feeds the transformed gradient into a base optimizer. Notably, GradPower requires only a single-line code change and no modifications to the base optimizer's internal logic, including the hyperparameters. When applied to Adam (termed AdamPower), GradPower consistently achieves lower terminal loss across diverse architectures (LLaMA, Qwen2MoE), parameter scales (66M to 2B), datasets (C4, OpenWebText), and learning-rate schedules (cosine, warmup-stable-decay). The most pronounced gains are observed when training modern mixture-of-experts models with warmup-stable-decay schedules. GradPower also integrates seamlessly with other state-of-the-art optimizers, such as Muon, yielding further improvements. Finally, we provide theoretical analyses that reveal the underlying mechanism of GradPower and highlight the influence of gradient noise.
翻译:摘要:我们提出GradPower——一种用于加速语言模型预训练的轻量级梯度变换技术。给定梯度向量$g=(g_i)_i$,GradPower首先应用逐元素符号-幂变换:$\varphi_p(g)=({\rm sign}(g_i)|g_i|^p)_{i}$(其中$p>0$为固定参数),随后将变换后的梯度输入基础优化器。值得注意的是,GradPower仅需单行代码改动,且无需修改基础优化器的内部逻辑(包括超参数)。当将其应用于Adam(称为AdamPower)时,GradPower在不同架构(LLaMA、Qwen2MoE)、参数量级(66M至2B)、数据集(C4、OpenWebText)及学习率调度策略(余弦衰减、预热身-稳态-衰减)下均能稳定实现更低的终端损失。基于预热身-稳态-衰减调度训练现代混合专家模型时,其增益最为显著。此外,GradPower可无缝集成至Muon等其他先进优化器中,进一步带来性能提升。最后,我们通过理论分析揭示了GradPower的内在机理,并阐明了梯度噪声的影响。