We propose GradPower, a lightweight gradient-transformation technique for accelerating language model pre-training. Given a gradient vector $g=(g_i)_i$, GradPower first applies the elementwise sign-power transformation: $\varphi_p(g)=({\rm sign}(g_i)|g_i|^p)_{i}$ for a fixed $p>0$, and then feeds the transformed gradient into a base optimizer. Notably, GradPower requires only a single-line code change and no modifications to the base optimizer's internal logic, including the hyperparameters. When applied to Adam (termed AdamPower), GradPower consistently achieves lower terminal loss across diverse architectures (LLaMA, Qwen2MoE), parameter scales (66M to 2B), datasets (C4, OpenWebText), and learning-rate schedules (cosine, warmup-stable-decay). The most pronounced gains are observed when training modern mixture-of-experts models with warmup-stable-decay schedules. GradPower also integrates seamlessly with other state-of-the-art optimizers, such as Muon, yielding further improvements. Finally, we provide theoretical analyses that reveal the underlying mechanism of GradPower and highlight the influence of gradient noise.
翻译:摘要:本文提出GradPower,一种轻量级梯度变换技术,用于加速语言模型预训练。给定梯度向量 $g=(g_i)_i$,GradPower 首先应用逐元素符号幂变换:$\varphi_p(g)=({\rm sign}(g_i)|g_i|^p)_{i}$,其中 $p>0$ 为固定参数,然后将变换后的梯度输入基础优化器。值得注意的是,GradPower 仅需单行代码改动,且无需修改基础优化器的内部逻辑(包括超参数)。当应用于 Adam(称为 AdamPower)时,GradPower 在不同架构(LLaMA、Qwen2MoE)、参数规模(66M 至 2B)、数据集(C4、OpenWebText)以及学习率调度策略(余弦衰减、预热-稳定-衰减)下均能持续降低最终损失。使用预热-稳定-衰减调度策略训练现代混合专家模型时,性能提升最为显著。此外,GradPower 可与 Muon 等前沿优化器无缝集成,带来进一步改进。最后,我们通过理论分析揭示了 GradPower 的潜在机制,并强调了梯度噪声的影响。