Several recent works have studied the convergence \textit{in high probability} of stochastic gradient descent (SGD) and its clipped variant. Compared to vanilla SGD, clipped SGD is practically more stable and has the additional theoretical benefit of logarithmic dependence on the failure probability. However, the convergence of other practical nonlinear variants of SGD, e.g., sign SGD, quantized SGD and normalized SGD, that achieve improved communication efficiency or accelerated convergence is much less understood. In this work, we study the convergence bounds \textit{in high probability} of a broad class of nonlinear SGD methods. For strongly convex loss functions with Lipschitz continuous gradients, we prove a logarithmic dependence on the failure probability, even when the noise is heavy-tailed. Strictly more general than the results for clipped SGD, our results hold for any nonlinearity with bounded (component-wise or joint) outputs, such as clipping, normalization, and quantization. Further, existing results with heavy-tailed noise assume bounded $\eta$-th central moments, with $\eta \in (1,2]$. In contrast, our refined analysis works even for $\eta=1$, strictly relaxing the noise moment assumptions in the literature.
翻译:近年来多项研究关注了随机梯度下降及其裁剪变体的高概率收敛性。相较于普通SGD,裁剪SGD在实践中更为稳定,且在理论上具有对失败概率呈对数依赖的额外优势。然而,其他实用的非线性SGD变体(如符号SGD、量化SGD和归一化SGD)在提升通信效率或加速收敛方面的收敛性尚未得到充分理解。本文研究了一类广泛非线性SGD方法的高概率收敛界。对于具有Lipschitz连续梯度的强凸损失函数,我们证明了即使在重尾噪声下,收敛界对失败概率仍呈对数依赖。与裁剪SGD的结果相比,我们的结论严格更具普适性,适用于任何具有有界(分量级或联合)输出的非线性函数,例如裁剪、归一化和量化。此外,现有重尾噪声结果通常假设噪声的$\eta$阶中心矩有界($\eta \in (1,2]$)。相比之下,我们改进的分析方法甚至适用于$\eta=1$的情况,从而严格放宽了文献中对噪声矩的假设条件。