Adaptive optimizers combining preconditioning, momentum, and weight decay (Adam and AdamW) are, under Polyak-Ruppert averaging, candidate engines for one-pass inference. Does the averaged iterate keep the classical Polyak-Ruppert central limit theorem (CLT), with sandwich covariance $H^{-1}SH^{-1}$ (Hessian $H$, gradient covariance $S$), under momentum and non-convergent preconditioning? The preconditioner-only analysis does not carry over: with momentum the canonical decomposition collapses to a tautology. Treating the augmented state (iterate, momentum buffer) as a time-varying linear stochastic approximation (SA), we prove (under local stabilization) positive drift stability, a non-autonomous Polyak-Ruppert CLT, and a projection identity. The upshot: the iterate-marginal covariance is exactly the plain stochastic gradient descent (SGD) sandwich $H^{-1}SH^{-1}$, so the adaptivity is asymptotically invisible. This holds for SA-Adam (sub-linearly vanishing momentum gain, $γ\in(α,1)$; the sub-linear regime is essential), not constant-$β$ deployed Adam. Coupled $L_2$ weight decay yields the ridge-penalized sandwich, extending one-pass inference to regularized problems.
翻译:结合预条件、动量和权重衰减的自适应优化器(Adam与AdamW)在Polyak-Ruppert平均下,成为单次推断的候选引擎。在动量与非收敛预条件下,平均迭代是否能保持经典Polyak-Ruppert中心极限定理(CLT),即夹层协方差$H^{-1}SH^{-1}$(Hessian矩阵$H$,梯度协方差$S$)?仅针对预条件的分析无法直接推广:引入动量后,规范分解退化为同义反复。将增广状态(迭代量、动量缓存)视为时变线性随机逼近(SA),我们证明(在局部稳定化下)正漂移稳定性、非自治Polyak-Ruppert CLT及投影恒等式。结论:迭代边际协方差恰好等于普通随机梯度下降(SGD)的夹层形式$H^{-1}SH^{-1}$,因此自适应性的影响在渐近意义上不可见。该结论适用于SA-Adam(动量增益呈次线性衰减,$γ\in(α,1)$;次线性区间至关重要),而非固定$β$的Adam。耦合$L_2$权重衰减产生岭惩罚夹层,将单次推断扩展至正则化问题。