基于对称化的非线性SGD在重尾噪声下的尖锐高概率收敛率 (Sharp High-Probability Rates for Nonlinear SGD under Heavy-Tailed Noise via Symmetrization)

We study convergence in high-probability of SGD-type methods in non-convex optimization and the presence of heavy-tailed noise. To combat the heavy-tailed noise, a general black-box nonlinear framework is considered, subsuming nonlinearities like sign, clipping, normalization and their smooth counterparts. Our first result shows that nonlinear SGD (N-SGD) achieves the rate $\widetilde{\mathcal{O}}(t^{-1/2})$, for any noise with unbounded moments and a symmetric probability density function (PDF). Crucially, N-SGD has exponentially decaying tails, matching the performance of linear SGD under light-tailed noise. To handle non-symmetric noise, we propose two novel estimators, based on the idea of noise symmetrization. The first, dubbed Symmetrized Gradient Estimator (SGE), assumes a noiseless gradient at any reference point is available at the start of training, while the second, dubbed Mini-batch SGE (MSGE), uses mini-batches to estimate the noiseless gradient. Combined with the nonlinear framework, we get N-SGE and N-MSGE methods, respectively, both achieving the same convergence rate and exponentially decaying tails as N-SGD, while allowing for non-symmetric noise with unbounded moments and PDF satisfying a mild technical condition, with N-MSGE additionally requiring bounded noise moment of order $p \in (1,2]$. Compared to works assuming noise with bounded $p$-th moment, our results: 1) are based on a novel symmetrization approach; 2) provide a unified framework and relaxed moment conditions; 3) imply optimal oracle complexity of N-SGD and N-SGE, strictly better than existing works when $p < 2$, while the complexity of N-MSGE is close to existing works. Compared to works assuming symmetric noise with unbounded moments, we: 1) provide a sharper analysis and improved rates; 2) facilitate state-dependent symmetric noise; 3) extend the strong guarantees to non-symmetric noise.

翻译：本研究探讨了非凸优化中SGD类方法在重尾噪声存在下的高概率收敛性。为应对重尾噪声，我们考虑一种通用的黑箱非线性框架，该框架涵盖了符号函数、截断、归一化及其平滑变体等非线性操作。我们的首个结果表明，对于任意具有无界矩和对称概率密度函数（PDF）的噪声，非线性SGD（N-SGD）能达到$\widetilde{\mathcal{O}}(t^{-1/2})$的收敛率。关键在于，N-SGD具有指数衰减的尾部概率，与轻尾噪声下线性SGD的性能相匹配。为处理非对称噪声，我们基于噪声对称化的思想提出了两种新颖的估计器。第一种称为对称化梯度估计器（SGE），它假设在训练开始时可以获得任意参考点处的无噪声梯度；第二种称为小批量SGE（MSGE），它使用小批量样本来估计无噪声梯度。结合非线性框架，我们分别得到了N-SGE和N-MSGE方法，两者均能达到与N-SGD相同的收敛率和指数衰减尾部概率，同时允许具有无界矩且PDF满足温和技术条件的非对称噪声，其中N-MSGE额外要求噪声的$p$阶矩有界（$p \in (1,2]$）。与假设噪声具有有界$p$阶矩的现有工作相比，我们的成果：1）基于新颖的对称化方法；2）提供了统一的框架和更宽松的矩条件；3）推导出N-SGD和N-SGE的最优预言机复杂度，当$p < 2$时严格优于现有工作，而N-MSGE的复杂度接近现有工作。与假设无界矩对称噪声的现有工作相比，我们：1）提供了更精细的分析和改进的收敛率；2）支持状态依赖的对称噪声；3）将强保证推广至非对称噪声情形。