In practice, the hyperparameters $(β_1, β_2)$ and weight-decay $λ$ in AdamW are typically kept at fixed values. Is there any reason to do otherwise? We show that for large-scale language model training, the answer is yes: by exploiting the power-law structure of language data, one can design time-varying schedules for $(β_1, β_2, λ)$ that deliver substantial performance gains. We study logarithmic-time scheduling, in which the optimizer's gradient memory horizon grows with training time. Although naive variants of this are unstable, we show that suitable damping mechanisms restore stability while preserving the benefits of longer memory. Based on this, we present ADANA, an AdamW-like optimizer that couples log-time schedules with explicit damping to balance stability and performance. We empirically evaluate ADANA across transformer scalings (45M to 2.6B parameters), comparing against AdamW, Muon, and AdEMAMix. When properly tuned, ADANA achieves up to 40% compute efficiency relative to a tuned AdamW, with gains that persist--and even improve--as model scale increases. We further show that similar benefits arise when applying logarithmic-time scheduling to AdEMAMix, and that logarithmic-time weight-decay alone can yield significant improvements. Finally, we present variants of ADANA that mitigate potential failure modes and improve robustness.
翻译:在实践中,AdamW中的超参数$(β_1, β_2)$和权重衰减$λ$通常保持为固定值。是否有理由采取其他做法?我们证明,对于大规模语言模型训练,答案是肯定的:通过利用语言数据的幂律结构,可以设计$(β_1, β_2, λ)$的时变调度方案,从而带来显著的性能提升。我们研究了对数时间调度,其中优化器的梯度记忆范围随训练时间增长。尽管其朴素变体存在不稳定性,但我们证明合适的阻尼机制能够恢复稳定性,同时保留长记忆的优势。基于此,我们提出了ADANA——一种类AdamW优化器,它将对数时间调度与显式阻尼相结合,以平衡稳定性与性能。我们在Transformer规模(45M至2.6B参数)上对ADANA进行了实证评估,并与AdamW、Muon和AdEMAMix进行了比较。经适当调优后,相较于调优后的AdamW,ADANA实现了高达40%的计算效率提升,且该增益随着模型规模增大而持续存在甚至增强。我们进一步证明,将对数时间调度应用于AdEMAMix时会产生类似优势,且仅对数时间权重衰减本身即可带来显著改进。最后,我们提出了ADANA的若干变体,以缓解潜在故障模式并提升鲁棒性。