Large-scale machine learning problems make the cost of hyperparameter tuning ever more prohibitive. This creates a need for algorithms that can tune themselves on-the-fly. We formalize the notion of "tuning-free" algorithms that can match the performance of optimally-tuned optimization algorithms up to polylogarithmic factors given only loose hints on the relevant problem parameters. We consider in particular algorithms that can match optimally-tuned Stochastic Gradient Descent (SGD). When the domain of optimization is bounded, we show tuning-free matching of SGD is possible and achieved by several existing algorithms. We prove that for the task of minimizing a convex and smooth or Lipschitz function over an unbounded domain, tuning-free optimization is impossible. We discuss conditions under which tuning-free optimization is possible even over unbounded domains. In particular, we show that the recently proposed DoG and DoWG algorithms are tuning-free when the noise distribution is sufficiently well-behaved. For the task of finding a stationary point of a smooth and potentially nonconvex function, we give a variant of SGD that matches the best-known high-probability convergence rate for tuned SGD at only an additional polylogarithmic cost. However, we also give an impossibility result that shows no algorithm can hope to match the optimal expected convergence rate for tuned SGD with high probability.
翻译:大规模机器学习问题使超参数调优的成本愈发高昂,这催生了能够自动在线调优算法的需求。本文形式化定义了"免调参"算法的概念——该类算法仅需问题相关参数的松散提示,即可在多项式对数因子范围内达到最优调优优化算法的性能。我们特别关注能够匹配最优调优随机梯度下降(SGD)的算法。当优化域有界时,我们证明免调参匹配SGD是可行的,且已有多种现有算法实现该目标。我们证明:对于在无界域上最小化凸且光滑或Lipschitz函数的任务,免调参优化是不可能的。我们讨论了在无界域上仍能实现免调参优化的条件,特别指出近期提出的DoG和DoWG算法在噪声分布足够良态时具备免调参特性。对于寻找光滑且可能非凸函数的驻点任务,我们提出SGD的变体,该变体在仅增加多项式对数开销的情况下,匹配了调参SGD已知最优的高概率收敛速率。然而,我们也给出一个不可能性结果:没有任何算法有望以高概率匹配调参SGD的最优期望收敛速率。