Classical worst-case optimization theory neither explains the success of optimization in machine learning, nor does it help with step size selection. In this paper we demonstrate the viability and advantages of replacing the classical 'convex function' framework with a 'random function' framework. With complexity $\mathcal{O}(n^3d^3)$, where $n$ is the number of steps and $d$ the number of dimensions, Bayesian optimization with gradients has not been viable in large dimension so far. By bridging the gap between Bayesian optimization (i.e. random function optimization theory) and classical optimization we establish viability. Specifically, we use a 'stochastic Taylor approximation' to rediscover gradient descent, which is scalable in high dimension due to $\mathcal{O}(nd)$ complexity. This rediscovery yields a specific step size schedule we call Random Function Descent (RFD). The advantage of this random function framework is that RFD is scale invariant and that it provides a theoretical foundation for common step size heuristics such as gradient clipping and gradual learning rate warmup.
翻译:经典的最坏情况优化理论既无法解释机器学习中优化方法的成功,也无法为步长选择提供指导。本文通过将经典的"凸函数"框架替换为"随机函数"框架,论证了该框架的可行性与优势。传统基于梯度的贝叶斯优化具有$\mathcal{O}(n^3d^3)$的计算复杂度(其中$n$为迭代步数,$d$为维度数),因此迄今无法应用于高维场景。通过搭建贝叶斯优化(即随机函数优化理论)与经典优化之间的桥梁,我们证明了该框架的可行性。具体而言,我们采用"随机泰勒逼近"方法重新发现了梯度下降法——该方法因$\mathcal{O}(nd)$的复杂度而具备高维可扩展性。这一重新发现产生了我们称之为随机函数下降法(RFD)的特定步长调度策略。该随机函数框架的优势在于:RFD具有尺度不变性,并为梯度裁剪、学习率渐进预热等常见步长启发式方法提供了理论基础。