We develop a worst-case complexity theory for stochastically preconditioned stochastic gradient descent (SPSGD) and its accelerated variants under heavy-tailed noise, a setting that encompasses widely used adaptive methods such as Adam, RMSProp, and Shampoo. We assume the stochastic gradient noise has a finite $p$-th moment for some $p \in (1,2]$, and measure convergence after $T$ iterations. While clipping and normalization are parallel tools for stabilizing training of SGD under heavy-tailed noise, there is a fundamental separation in their worst-case properties in stochastically preconditioned settings. We demonstrate that normalization guarantees convergence to a first-order stationary point at rate $\mathcal{O}(T^{-\frac{p-1}{3p-2}})$ when problem parameters are known, and $\mathcal{O}(T^{-\frac{p-1}{2p}})$ when problem parameters are unknown, matching the optimal rates for normalized SGD, respectively. In contrast, we prove that clipping may fail to converge in the worst case due to the statistical dependence between the stochastic preconditioner and the gradient estimates. To enable the analysis, we develop a novel vector-valued Burkholder-type inequality that may be of independent interest. These results provide a theoretical explanation for the empirical preference for normalization over clipping in large-scale model training.
翻译:我们针对重尾噪声下的随机预条件随机梯度下降(SPSGD)及其加速变体,建立了一套最坏情况复杂度理论,该框架涵盖了广泛使用的自适应方法,如Adam、RMSProp和Shampoo。我们假设随机梯度噪声对某个$p \in (1,2]$具有有限的$p$阶矩,并以$T$次迭代后的收敛性作为衡量标准。虽然裁剪和归一化是重尾噪声下稳定SGD训练的并行工具,但在随机预条件设置中,它们的最坏情况特性存在根本性差异。我们证明,当问题参数已知时,归一化能保证以$\mathcal{O}(T^{-\frac{p-1}{3p-2}})$的速率收敛到一阶稳定点;当问题参数未知时,收敛速率为$\mathcal{O}(T^{-\frac{p-1}{2p}})$,分别与归一化SGD的最优速率相匹配。相比之下,我们证明裁剪在最坏情况下可能因随机预条件子与梯度估计之间的统计依赖性而无法收敛。为支持分析,我们提出了一种新颖的向量值Burkholder型不等式,该不等式可能具有独立的学术价值。这些结果为大规模模型训练中经验上更倾向于使用归一化而非裁剪的现象提供了理论解释。