Beyond Discretization: Learning the Optimal Solution Path

Many applications require minimizing a family of optimization problems indexed by some hyperparameter $\lambda \in \Lambda$ to obtain an entire solution path. Traditional approaches proceed by discretizing $\Lambda$ and solving a series of optimization problems. We propose an alternative approach that parameterizes the solution path with a set of basis functions and solves a \emph{single} stochastic optimization problem to learn the entire solution path. Our method offers substantial complexity improvements over discretization. When using constant-step size SGD, the uniform error of our learned solution path relative to the true path exhibits linear convergence to a constant related to the expressiveness of the basis. When the true solution path lies in the span of the basis, this constant is zero. We also prove stronger results for special cases common in machine learning: When $\lambda \in [-1, 1]$ and the solution path is $\nu$-times differentiable, constant step-size SGD learns a path with $\epsilon$ uniform error after at most $O(\epsilon^{\frac{1}{1-\nu}} \log(1/\epsilon))$ iterations, and when the solution path is analytic, it only requires $O\left(\log^2(1/\epsilon)\log\log(1/\epsilon)\right)$. By comparison, the best-known discretization schemes in these settings require at least $O(\epsilon^{-1/2})$ discretization points (and even more gradient calls). Finally, we propose an adaptive variant of our method that sequentially adds basis functions and demonstrates strong numerical performance through experiments.

翻译：许多应用需要通过最小化一系列由某个超参数 $\lambda \in \Lambda$ 索引的优化问题来获取完整的解路径。传统方法通过对 $\Lambda$ 进行离散化并求解一系列优化问题来实现。我们提出了一种替代方法，该方法使用一组基函数对解路径进行参数化，并通过求解一个单一的随机优化问题来学习整个解路径。我们的方法相较于离散化方法在复杂度上有显著提升。当使用恒定步长的随机梯度下降法（SGD）时，我们学习到的解路径相对于真实路径的均匀误差会线性收敛到一个与基函数表达能力相关的常数。当真实解路径位于基函数的张成空间内时，该常数为零。我们还针对机器学习中常见的特殊情形证明了更强的结果：当 $\lambda \in [-1, 1]$ 且解路径是 $\nu$ 次可微时，恒定步长 SGD 最多经过 $O(\epsilon^{\frac{1}{1-\nu}} \log(1/\epsilon))$ 次迭代即可学习到具有 $\epsilon$ 均匀误差的路径；当解路径是解析函数时，仅需 $O\left(\log^2(1/\epsilon)\log\log(1/\epsilon)\right)$ 次迭代。相比之下，在这些设定下已知的最佳离散化方案至少需要 $O(\epsilon^{-1/2})$ 个离散点（甚至需要更多的梯度计算）。最后，我们提出了一种自适应变体方法，该方法能顺序地添加基函数，并通过实验展示了其优异的数值性能。