Large-scale machine learning problems make the cost of hyperparameter tuning ever more prohibitive. This creates a need for algorithms that can tune themselves on-the-fly. We formalize the notion of "tuning-free" algorithms that can match the performance of optimally-tuned optimization algorithms up to polylogarithmic factors given only loose hints on the relevant problem parameters. We consider in particular algorithms that can match optimally-tuned Stochastic Gradient Descent (SGD). When the domain of optimization is bounded, we show tuning-free matching of SGD is possible and achieved by several existing algorithms. We prove that for the task of minimizing a convex and smooth or Lipschitz function over an unbounded domain, tuning-free optimization is impossible. We discuss conditions under which tuning-free optimization is possible even over unbounded domains. In particular, we show that the recently proposed DoG and DoWG algorithms are tuning-free when the noise distribution is sufficiently well-behaved. For the task of finding a stationary point of a smooth and potentially nonconvex function, we give a variant of SGD that matches the best-known high-probability convergence rate for tuned SGD at only an additional polylogarithmic cost. However, we also give an impossibility result that shows no algorithm can hope to match the optimal expected convergence rate for tuned SGD with high probability.
翻译:大规模机器学习问题使得超参数调整的成本日益高昂。这催生了对能够自动在线调整算法的需求。我们形式化定义了"免调参"算法概念,此类算法在仅获知相关问题的粗略参数提示下,其性能可与经过最优调参的优化算法相媲美(仅相差多对数因子)。特别地,我们重点考虑能与最优调参随机梯度下降法(SGD)相匹配的算法。当优化域有界时,我们证明免调参匹配SGD是可行的,并且现有若干算法已实现了这一目标。我们证明:对于无界域上凸且光滑或Lipschitz函数的极小化任务,免调参优化是不可能的。我们讨论了在无界域上实现免调参优化的条件。特别地,我们指出当噪声分布充分良好时,近期提出的DoG和DoWG算法具有免调参特性。对于寻找光滑且可能非凸函数的驻点任务,我们给出了SGD的一个变体,该变体仅需额外多对数代价即可达到已知最优调参SGD的高概率收敛率。然而,我们同时给出了一个不可能性结果,证明没有算法能期望以高概率达到最优调参SGD的期望收敛率。