Accelerated Gradient Algorithms with Adaptive Subspace Search for Instance-Faster Optimization

Gradient-based minimax optimal algorithms have greatly promoted the development of continuous optimization and machine learning. One seminal work due to Yurii Nesterov [Nes83a] established $\tilde{\mathcal{O}}(\sqrt{L/\mu})$ gradient complexity for minimizing an $L$-smooth $\mu$-strongly convex objective. However, an ideal algorithm would adapt to the explicit complexity of a particular objective function and incur faster rates for simpler problems, triggering our reconsideration of two defeats of existing optimization modeling and analysis. (i) The worst-case optimality is neither the instance optimality nor such one in reality. (ii) Traditional $L$-smoothness condition may not be the primary abstraction/characterization for modern practical problems. In this paper, we open up a new way to design and analyze gradient-based algorithms with direct applications in machine learning, including linear regression and beyond. We introduce two factors $(\alpha, \tau_{\alpha})$ to refine the description of the degenerated condition of the optimization problems based on the observation that the singular values of Hessian often drop sharply. We design adaptive algorithms that solve simpler problems without pre-known knowledge with reduced gradient or analogous oracle accesses. The algorithms also improve the state-of-art complexities for several problems in machine learning, thereby solving the open problem of how to design faster algorithms in light of the known complexity lower bounds. Specially, with the $\mathcal{O}(1)$-nuclear norm bounded, we achieve an optimal $\tilde{\mathcal{O}}(\mu^{-1/3})$ (v.s. $\tilde{\mathcal{O}}(\mu^{-1/2})$) gradient complexity for linear regression. We hope this work could invoke the rethinking for understanding the difficulty of modern problems in optimization.

翻译：基于梯度的极小极大最优算法极大地促进了连续优化与机器学习的发展。Nesterov [Nes83a] 的开创性工作建立了在最小化 $L$-光滑 $\mu$-强凸目标函数时 $\tilde{\mathcal{O}}(\sqrt{L/\mu})$ 的梯度复杂度。然而，理想算法应能适应特定目标函数的显式复杂度，并为简单问题提供更快的求解速率，这促使我们重新审视现有优化建模与分析的两类缺陷：（i）最坏情况最优性既非实例最优性，也非实际场景中的最优性；（ii）传统 $L$-光滑条件可能并非现代实际问题的主要抽象或刻画。本文开辟了设计并分析基于梯度算法的新途径，该算法可直接应用于机器学习领域（包括线性回归等）。我们引入两个因子 $(\alpha, \tau_{\alpha})$ 来精炼描述优化问题的退化条件，其核心基于海森矩阵奇异值常出现剧烈下降的观测。我们设计的自适应算法无需先验知识即可解决简单问题，同时减少梯度或类似预言机的访问次数。该算法还改进了机器学习中多个问题的最优复杂度，从而解决了在已知复杂度下界约束下如何设计更快算法的开放性问题。特别地，在 $\mathcal{O}(1)$-核范数有界条件下，线性回归问题实现了最优 $\tilde{\mathcal{O}}(\mu^{-1/3})$（相较 $\tilde{\mathcal{O}}(\mu^{-1/2})$）的梯度复杂度。我们希望本研究能引发对现代优化问题困难度的重新思考。