Heavy-tailed noise in nonconvex stochastic optimization has garnered increasing research interest, as empirical studies, including those on training attention models, suggest it is a more realistic gradient noise condition. This paper studies first-order nonconvex stochastic optimization under heavy-tailed gradient noise in a decentralized setup, where each node can only communicate with its direct neighbors in a predefined graph. Specifically, we consider a class of heavy-tailed gradient noise that is zero-mean and has only $p$-th moment for $p \in (1, 2]$. We propose GT-NSGDm, Gradient Tracking based Normalized Stochastic Gradient Descent with momentum, that utilizes normalization, in conjunction with gradient tracking and momentum, to cope with heavy-tailed noise on distributed nodes. We show that, when the communication graph admits primitive and doubly stochastic weights, GT-NSGDm guarantees, for the \textit{first} time in the literature, that the expected gradient norm converges at an optimal non-asymptotic rate $O\big(1/T^{(p-1)/(3p-2)}\big)$, which matches the lower bound in the centralized setup. When tail index $p$ is unknown, GT-NSGDm attains a non-asymptotic rate $O\big( 1/T^{(p-1)/(2p)} \big)$ that is, for $p < 2$, topology independent and has a speedup factor $n^{1-1/p}$ in terms of the number of nodes $n$. Finally, experiments on nonconvex linear regression with tokenized synthetic data and decentralized training of language models on a real-world corpus demonstrate that GT-NSGDm is more robust and efficient than baselines.
翻译:重尾噪声在非凸随机优化中日益受到研究关注,包括注意力模型训练在内的实证研究表明,重尾噪声是一种更真实的梯度噪声条件。本文研究分散式框架下(各节点仅能与预定义图中的直接邻居通信)存在重尾梯度噪声的一阶非凸随机优化问题。具体而言,我们考虑一类均值为零、仅具有$p$($p\in (1,2]$)阶矩的重尾梯度噪声。我们提出基于梯度跟踪的归一化随机梯度下降动量方法(GT-NSGDm),该方法结合归一化技术、梯度跟踪与动量机制,以应对分布式节点上的重尾噪声。理论证明,当通信图具有本原双随机权重时,GT-NSGDm首次在文献中保证了期望梯度范数以最优非渐近速率$O\big(1/T^{(p-1)/(3p-2)}\big)$收敛,该速率匹配集中式框架下的下界。当尾部指数$p$未知时,GT-NSGDm可实现非渐近收敛速率$O\big(1/T^{(p-1)/(2p)}\big)$,对于$p<2$的情形该速率具有拓扑无关性,且关于节点数$n$存在$n^{1-1/p}$的加速因子。最后,基于分词合成数据的非凸线性回归实验以及在真实语料库上的分散式语言模型训练实验表明,GT-NSGDm相比基线方法具有更强的鲁棒性和更高的效率。