While gradient based methods are ubiquitous in machine learning, selecting the right step size often requires "hyperparameter tuning". This is because backtracking procedures like Armijo's rule depend on quality evaluations in every step, which are not available in a stochastic context. Since optimization schemes can be motivated using Taylor approximations, we replace the Taylor approximation with the conditional expectation (the best $L^2$ estimator) and propose "Random Function Descent" (RFD). Under light assumptions common in Bayesian optimization, we prove that RFD is identical to gradient descent, but with calculable step sizes, even in a stochastic context. We beat untuned Adam in synthetic benchmarks. To close the performance gap to tuned Adam, we propose a heuristic extension competitive with tuned Adam.
翻译:虽然基于梯度的方法在机器学习中无处不在,但选择合适的步长通常需要"超参数调优"。这是因为像Armijo准则这样的回溯过程依赖于每一步的质量评估,而这在随机环境下是不可行的。由于优化方案可以通过泰勒近似来推导,我们将泰勒近似替换为条件期望(最优$L^2$估计量),并提出了"随机函数下降"(RFD)。在贝叶斯优化中常见的轻量假设下,我们证明RFD与梯度下降相同,但即使是在随机环境下,其步长也是可计算的。我们在合成基准测试中击败了未经调优的Adam。为了缩小与调优后Adam的性能差距,我们提出了一种与调优后Adam相媲美的启发式扩展方法。