Adaptive Strategies in Non-convex Optimization

An algorithm is said to be adaptive to a certain parameter (of the problem) if it does not need a priori knowledge of such a parameter but performs competitively to those that know it. This dissertation presents our work on adaptive algorithms in following scenarios: 1. In the stochastic optimization setting, we only receive stochastic gradients and the level of noise in evaluating them greatly affects the convergence rate. Tuning is typically required when without prior knowledge of the noise scale in order to achieve the optimal rate. Considering this, we designed and analyzed noise-adaptive algorithms that can automatically ensure (near)-optimal rates under different noise scales without knowing it. 2. In training deep neural networks, the scales of gradient magnitudes in each coordinate can scatter across a very wide range unless normalization techniques, like BatchNorm, are employed. In such situations, algorithms not addressing this problem of gradient scales can behave very poorly. To mitigate this, we formally established the advantage of scale-free algorithms that adapt to the gradient scales and presented its real benefits in empirical experiments. 3. Traditional analyses in non-convex optimization typically rely on the smoothness assumption. Yet, this condition does not capture the properties of some deep learning objective functions, including the ones involving Long Short-Term Memory networks and Transformers. Instead, they satisfy a much more relaxed condition, with potentially unbounded smoothness. Under this condition, we show that a generalized SignSGD algorithm can theoretically match the best-known convergence rates obtained by SGD with gradient clipping but does not need explicit clipping at all, and it can empirically match the performance of Adam and beat others. Moreover, it can also be made to automatically adapt to the unknown relaxed smoothness.

翻译：若算法无需预先知晓问题的某一参数，却能在性能上与已知该参数的算法相匹敌，则称该算法对该参数具有自适应性。本论文介绍了我们在以下场景中关于自适应算法的研究：1. 在随机优化场景中，我们仅能获取随机梯度，而评估梯度时的噪声水平会显著影响收敛速度。在未知噪声规模的条件下，为达到最优收敛速率通常需要进行参数调优。针对此问题，我们设计并分析了一种噪声自适应算法，该算法能在未知噪声规模的情况下自动确保（近似）最优收敛速率。2. 在训练深度神经网络时，若未采用批归一化（BatchNorm）等归一化技术，各坐标方向上的梯度幅度可能跨越极大范围。在此类情形下，未处理梯度幅度差异的算法可能表现极差。为缓解此问题，我们从理论上严格证明了自适应于梯度幅度的无尺度算法的优势，并通过实验验证了其实际效益。3. 传统非凸优化分析通常依赖于光滑性假设。然而，该假设无法刻画某些深度学习目标函数的性质，例如涉及长短期记忆网络（LSTM）和Transformer的目标函数。相反，这些目标函数满足一种更为宽松的条件，即可能具有无界光滑性。在此条件下，我们证明了一种广义SignSGD算法理论上能达到与采用梯度裁剪的SGD算法相同的最优收敛速率，且完全无需显式裁剪操作；实验表明，该算法性能可与Adam相匹敌并超越其他算法。此外，该算法还可实现针对未知宽松光滑条件的自动自适应。