The problem of minimizing the sum of $n$ functions in $d$ dimensions is ubiquitous in machine learning and statistics. In many applications where the number of observations $n$ is large, it is necessary to use incremental or stochastic methods, as their per-iteration cost is independent of $n$. Of these, Quasi-Newton (QN) methods strike a balance between the per-iteration cost and the convergence rate. Specifically, they exhibit a superlinear rate with $O(d^2)$ cost in contrast to the linear rate of first-order methods with $O(d)$ cost and the quadratic rate of second-order methods with $O(d^3)$ cost. However, existing incremental methods have notable shortcomings: Incremental Quasi-Newton (IQN) only exhibits asymptotic superlinear convergence. In contrast, Incremental Greedy BFGS (IGS) offers explicit superlinear convergence but suffers from poor empirical performance and has a per-iteration cost of $O(d^3)$. To address these issues, we introduce the Sharpened Lazy Incremental Quasi-Newton Method (SLIQN) that achieves the best of both worlds: an explicit superlinear convergence rate, and superior empirical performance at a per-iteration $O(d^2)$ cost. SLIQN features two key changes: first, it incorporates a hybrid strategy of using both classic and greedy BFGS updates, allowing it to empirically outperform both IQN and IGS. Second, it employs a clever constant multiplicative factor along with a lazy propagation strategy, which enables it to have a cost of $O(d^2)$. Additionally, our experiments demonstrate the superiority of SLIQN over other incremental and stochastic Quasi-Newton variants and establish its competitiveness with second-order incremental methods.
翻译:在机器学习与统计学中,最小化 $n$ 个 $d$ 维函数之和的问题普遍存在。在许多观测数 $n$ 较大的应用中,由于增量法或随机法的每次迭代成本与 $n$ 无关,因此必须采用这些方法。其中,拟牛顿(QN)方法在每次迭代成本与收敛速度之间取得了平衡。具体而言,相比一阶方法的线性收敛速度($O(d)$ 成本)和二阶方法的二次收敛速度($O(d^3)$ 成本),QN 方法以 $O(d^2)$ 成本实现超线性收敛速度。然而,现有增量方法存在显著缺陷:增量拟牛顿(IQN)仅呈现渐近超线性收敛,而增量贪婪BFGS(IGS)虽提供显式超线性收敛,但经验性能较差且每次迭代成本为 $O(d^3)$。为解决这些问题,我们提出锐化懒惰增量拟牛顿方法(SLIQN),该方法兼具两者优势:显式超线性收敛速度,以及每次迭代成本 $O(d^2)$ 下的卓越经验性能。SLIQN 包含两项关键改进:首先,它融合了经典与贪婪 BFGS 更新的混合策略,使其经验性能优于 IQN 和 IGS;其次,它采用巧妙的常数乘性因子配合懒惰传播策略,从而将成本降至 $O(d^2)$。此外,我们的实验证明了 SLIQN 相较于其他增量式和随机式拟牛顿变体的优越性,并确立了其与二阶增量方法的竞争力。