Adaptive Strategies in Non-convex Optimization

An algorithm is said to be adaptive to a certain parameter (of the problem) if it does not need a priori knowledge of such a parameter but performs competitively to those that know it. This dissertation presents our work on adaptive algorithms in following scenarios: 1. In the stochastic optimization setting, we only receive stochastic gradients and the level of noise in evaluating them greatly affects the convergence rate. Tuning is typically required when without prior knowledge of the noise scale in order to achieve the optimal rate. Considering this, we designed and analyzed noise-adaptive algorithms that can automatically ensure (near)-optimal rates under different noise scales without knowing it. 2. In training deep neural networks, the scales of gradient magnitudes in each coordinate can scatter across a very wide range unless normalization techniques, like BatchNorm, are employed. In such situations, algorithms not addressing this problem of gradient scales can behave very poorly. To mitigate this, we formally established the advantage of scale-free algorithms that adapt to the gradient scales and presented its real benefits in empirical experiments. 3. Traditional analyses in non-convex optimization typically rely on the smoothness assumption. Yet, this condition does not capture the properties of some deep learning objective functions, including the ones involving Long Short-Term Memory networks and Transformers. Instead, they satisfy a much more relaxed condition, with potentially unbounded smoothness. Under this condition, we show that a generalized SignSGD algorithm can theoretically match the best-known convergence rates obtained by SGD with gradient clipping but does not need explicit clipping at all, and it can empirically match the performance of Adam and beat others. Moreover, it can also be made to automatically adapt to the unknown relaxed smoothness.

翻译：如果一个算法在不需要事先知道某个（问题的）参数的情况下，其表现能与已知该参数的算法相竞争，则称该算法对该参数具有自适应性。本论文介绍了我们在以下场景中关于自适应算法的研究：1. 在随机优化问题中，我们仅能获取随机梯度，而评估梯度时的噪声水平会显著影响收敛速度。在缺乏噪声规模先验知识的情况下，通常需要调参才能达到最优收敛速率。为此，我们设计并分析了噪声自适应算法，这些算法能在未知噪声规模的前提下自动保证（近）最优收敛率。2. 在深度神经网络训练中，若不采用BatchNorm等归一化技术，每个坐标方向上的梯度幅度可能分布在一个非常宽的范围内。在这种情况下，未处理梯度尺度问题的算法表现会很差。为缓解这一问题，我们正式论证了适应梯度尺度的无尺度算法的优势，并通过实证实验展示了其实际效益。3. 传统的非凸优化分析通常依赖光滑性假设。然而，该条件无法刻画某些深度学习目标函数的性质，例如涉及长短期记忆网络和Transformer的目标函数。这些函数实际上满足一个更宽松的条件——可能具有无界光滑性。在此条件下，我们证明广义SignSGD算法在理论上能达到使用梯度裁剪的SGD的最佳已知收敛率，但无需显式裁剪，且在实证中可与Adam的表现匹敌并超过其他算法。此外，该算法还能自动适应未知的宽松光滑性。