Tuning hyperparameters, such as the stepsize, presents a major challenge of training machine learning models. To address this challenge, numerous adaptive optimization algorithms have been developed that achieve near-optimal complexities, even when stepsizes are independent of problem-specific parameters, provided that the loss function is $L$-smooth. However, as the assumption is relaxed to the more realistic $(L_0, L_1)$-smoothness, all existing convergence results still necessitate tuning of the stepsize. In this study, we demonstrate that Normalized Stochastic Gradient Descent with Momentum (NSGD-M) can achieve a (nearly) rate-optimal complexity without prior knowledge of any problem parameter, though this comes at the cost of introducing an exponential term dependent on $L_1$ in the complexity. We further establish that this exponential term is inevitable to such schemes by introducing a theoretical framework of lower bounds tailored explicitly for parameter-agnostic algorithms. Interestingly, in deterministic settings, the exponential factor can be neutralized by employing Gradient Descent with a Backtracking Line Search. To the best of our knowledge, these findings represent the first parameter-agnostic convergence results under the generalized smoothness condition. Our empirical experiments further confirm our theoretical insights.
翻译:超参数(如步长)的调优是训练机器学习模型的主要挑战之一。为应对这一挑战,研究者已开发出众多自适应优化算法,即便步长与问题特定参数无关,只要损失函数满足$L$-光滑性假设,这些算法也能实现近乎最优的复杂度。然而,当该假设放宽至更实际的$(L_0, L_1)$-光滑性条件后,现有所有收敛结果仍需要对步长进行调优。本研究表明,带动量的归一化随机梯度下降(NSGD-M)可在无需任何问题参数先验知识的情况下实现(近乎)最优收敛速率,尽管其复杂度中引入了依赖于$L_1$的指数项。我们通过专门针对参数无关算法构建的下界理论框架,进一步证明该指数项对此类方法具有不可避免性。有趣的是,在确定性场景下,采用回溯线搜索的梯度下降可抵消该指数因子。据我们所知,这些发现首次在广义光滑性条件下建立了参数无关的收敛性结论。我们的实验进一步验证了上述理论洞见。