We consider the problem of minimizing the average of a large number of smooth but possibly non-convex functions. In the context of most machine learning applications, each loss function is non-negative and thus can be expressed as the composition of a square and its real-valued square root. This reformulation allows us to apply the Gauss-Newton method, or the Levenberg-Marquardt method when adding a quadratic regularization. The resulting algorithm, while being computationally as efficient as the vanilla stochastic gradient method, is highly adaptive and can automatically warmup and decay the effective stepsize while tracking the non-negative loss landscape. We provide a tight convergence analysis, leveraging new techniques, in the stochastic convex and non-convex settings. In particular, in the convex case, the method does not require access to the gradient Lipshitz constant for convergence, and is guaranteed to never diverge. The convergence rates and empirical evaluations compare favorably to the classical (stochastic) gradient method as well as to several other adaptive methods.
翻译:我们考虑最小化大量光滑但可能非凸函数平均值的问题。在大多数机器学习应用背景下,每个损失函数均为非负,因而可表示为平方函数与其实值平方根函数的复合。此重构使我们能够应用高斯-牛顿法,或在添加二次正则化时应用Levenberg-Marquardt方法。所得算法在计算效率上与经典随机梯度法相当的同时,具有高度自适应性,能够自动预热并衰减有效步长,同时追踪非负损失函数的几何形态。我们运用新技术在随机凸优化与非凸优化设定下给出了紧收敛性分析。特别地,在凸情形中,该方法无需依赖梯度Lipschitz常数即可保证收敛,且确保永不发散。其收敛速率与实证评估结果均优于经典(随机)梯度法及其他多种自适应方法。