Adam has been at the core of large-scale training for almost a decade, yet a simple empirical fact remains unaccounted for: both validation scores and the qualitative behaviour of the training runs improve when the momentum parameters satisfy $β_{1}=β_{2}$. Some recent studies have reported this pattern, but there is still no explanation for why this choice helps. We show that this choice is closely tied to a structural property that we refer to as \textit{gradient scale invariance}. We formalize this notion and prove that Adam becomes gradient scale invariant of first order if and only if $β_{1}=β_{2}$. This perspective places the balanced regime of Adam in direct alignment with the design principles underlying several recent optimizers that explicitly enforce scale-robust updates. The theory is supported by experiments across vision and language tasks, and across different architectural families, in which rescaling the gradient has a markedly smoother effect on the update when $β_{1}=β_{2}$. Overall, our results offer a coherent explanation for an open question in the behavior of Adam and provide a simple principle that helps guide the design of future optimizers.
翻译:Adam优化器已在大规模训练中应用近十年,但一个简单的经验事实仍未得到解释:当动量参数满足$β_{1}=β_{2}$时,验证分数和训练过程的定性行为均会改善。近期部分研究已报告此现象,但尚未阐明该选择为何有效。本文证明,该选择与一种称为\textit{梯度尺度不变性}的结构特性密切相关。我们形式化这一概念,并证明当且仅当$β_{1}=β_{2}$时,Adam具有一阶梯度尺度不变性。这一视角将Adam的平衡机制与若干近期优化器的设计原则直接关联——这些优化器明确要求更新过程具有尺度鲁棒性。通过视觉与语言任务、以及不同架构家族的实验验证,当$β_{1}=β_{2}$时,梯度重缩放对更新过程的影响显著更平滑。总体而言,我们的研究为Adam行为中一个悬而未决的问题提供了连贯解释,并为未来优化器的设计提供了简明指导原则。