Recently, several studies consider the stochastic optimization problem but in a heavy-tailed noise regime, i.e., the difference between the stochastic gradient and the true gradient is assumed to have a finite $p$-th moment (say being upper bounded by $\sigma^{p}$ for some $\sigma\geq0$) where $p\in(1,2]$, which not only generalizes the traditional finite variance assumption ($p=2$) but also has been observed in practice for several different tasks. Under this challenging assumption, lots of new progress has been made for either convex or nonconvex problems, however, most of which only consider smooth objectives. In contrast, people have not fully explored and well understood this problem when functions are nonsmooth. This paper aims to fill this crucial gap by providing a comprehensive analysis of stochastic nonsmooth convex optimization with heavy-tailed noises. We revisit a simple clipping-based algorithm, whereas, which is only proved to converge in expectation but under the additional strong convexity assumption. Under appropriate choices of parameters, for both convex and strongly convex functions, we not only establish the first high-probability rates but also give refined in-expectation bounds compared with existing works. Remarkably, all of our results are optimal (or nearly optimal up to logarithmic factors) with respect to the time horizon $T$ even when $T$ is unknown in advance. Additionally, we show how to make the algorithm parameter-free with respect to $\sigma$, in other words, the algorithm can still guarantee convergence without any prior knowledge of $\sigma$.
翻译:近期,若干研究聚焦于重尾噪声场景下的随机优化问题,即假设随机梯度与真实梯度之间的差值仅具有有限的$p$阶矩(例如存在$\sigma\geq0$使其上界为$\sigma^p$),其中$p\in(1,2]$。这一假设不仅推广了传统的有限方差假设($p=2$),而且已在多种实际任务中得到验证。在此挑战性假设下,针对凸或非凸问题已取得诸多新进展,然而大多数工作仅考虑光滑目标函数。相比之下,当函数非光滑时,该问题尚未得到充分探索与深入理解。本文旨在通过系统分析重尾噪声下的随机非光滑凸优化问题,填补这一关键空白。我们重新审视一种基于裁剪的经典算法,该算法此前仅在额外强凸性假设下被证明具有期望收敛性。通过合理选择参数,对于凸函数与强凸函数,我们不仅首次建立了高概率收敛率,还给出了相较于现有工作更优的期望收敛界。值得注意的是,即使时间步长$T$预先未知,我们的所有结果关于$T$均达到最优(或至多相差对数因子)。此外,我们展示了如何使算法对参数$\sigma$实现无参数化,即算法无需任何关于$\sigma$的先验知识即可保证收敛性。