Gradient clipping is a commonly used technique to stabilize the training process of neural networks. A growing body of studies has shown that gradient clipping is a promising technique for dealing with the heavy-tailed behavior that emerged in stochastic optimization as well. While gradient clipping is significant, its theoretical guarantees are scarce. Most theoretical guarantees only provide an in-expectation analysis and only focus on optimization performance. In this paper, we provide high probability analysis in the non-convex setting and derive the optimization bound and the generalization bound simultaneously for popular stochastic optimization algorithms with gradient clipping, including stochastic gradient descent and its variants of momentum and adaptive stepsizes. With the gradient clipping, we study a heavy-tailed assumption that the gradients only have bounded $\alpha$-th moments for some $\alpha \in (1, 2]$, which is much weaker than the standard bounded second-moment assumption. Overall, our study provides a relatively complete picture for the theoretical guarantee of stochastic optimization algorithms with clipping.
翻译:梯度裁剪是稳定神经网络训练过程中常用的技术。越来越多的研究表明,梯度裁剪也是处理随机优化中出现的重尾行为的一种有前景的技术。尽管梯度裁剪具有重要意义,但其理论保证却较为稀缺。大多数理论保证仅提供期望分析,且只关注优化性能。本文在非凸设定下提供高概率分析,并同时推导了带梯度裁剪的流行随机优化算法(包括随机梯度下降及其动量和自适应步长变体)的优化界和泛化界。借助梯度裁剪,我们研究了一个重尾假设,即梯度仅具有有界的$\alpha$阶矩(其中$\alpha \in (1, 2]$),该假设远弱于标准的有界二阶矩假设。总体而言,我们的研究为带裁剪的随机优化算法的理论保证提供了一个相对完整的图景。