Fast Unconstrained Optimization via Hessian Averaging and Adaptive Gradient Sampling Methods

We consider minimizing finite-sum and expectation objective functions via Hessian-averaging based subsampled Newton methods. These methods allow for gradient inexactness and have fixed per-iteration Hessian approximation costs. The recent work (Na et al. 2023) demonstrated that Hessian averaging can be utilized to achieve fast $\mathcal{O}\left(\sqrt{\tfrac{\log k}{k}}\right)$ local superlinear convergence for strongly convex functions in high probability, while maintaining fixed per-iteration Hessian costs. These methods, however, require gradient exactness and strong convexity, which poses challenges for their practical implementation. To address this concern we consider Hessian-averaged methods that allow gradient inexactness via norm condition based adaptive-sampling strategies. For the finite-sum problem we utilize deterministic sampling techniques which lead to global linear and sublinear convergence rates for strongly convex and nonconvex functions respectively. In this setting we are able to derive an improved deterministic local superlinear convergence rate of $\mathcal{O}\left(\tfrac{1}{k}\right)$. For the %expected risk expectation problem we utilize stochastic sampling techniques, and derive global linear and sublinear rates for strongly convex and nonconvex functions, as well as a $\mathcal{O}\left(\tfrac{1}{\sqrt{k}}\right)$ local superlinear convergence rate, all in expectation. We present novel analysis techniques that differ from the previous probabilistic results. Additionally, we propose scalable and efficient variations of these methods via diagonal approximations and derive the novel diagonally-averaged Newton (Dan) method for large-scale problems. Our numerical results demonstrate that the Hessian averaging not only helps with convergence, but can lead to state-of-the-art performance on difficult problems such as CIFAR100 classification with ResNets.

翻译：本文研究基于Hessian平均的子采样牛顿法在有限和与期望目标函数最小化问题中的应用。这些方法允许梯度不精确性，并具有固定的每次迭代Hessian近似计算成本。近期研究（Na等人，2023）表明，对于强凸函数，Hessian平均技术能以高概率实现$\mathcal{O}\left(\sqrt{\tfrac{\log k}{k}}\right)$的局部超线性收敛速度，同时保持固定的每次迭代Hessian计算成本。然而，这些方法要求梯度精确性与强凸性假设，这为其实际应用带来了挑战。为解决此问题，我们研究通过基于范数条件的自适应采样策略允许梯度不精确性的Hessian平均方法。针对有限和问题，我们采用确定性采样技术，分别对强凸函数和非凸函数获得了全局线性与次线性收敛速率。在此设定下，我们推导出改进的确定性局部超线性收敛速率$\mathcal{O}\left(\tfrac{1}{k}\right)$。针对期望风险问题，我们采用随机采样技术，推导出强凸函数和非凸函数的全局线性与次线性收敛速率，以及期望意义下的$\mathcal{O}\left(\tfrac{1}{\sqrt{k}}\right)$局部超线性收敛速率。我们提出了与先前概率分析不同的新颖分析技术。此外，通过对角近似我们提出了这些方法的可扩展高效变体，并推导出面向大规模问题的新型对角平均牛顿（Dan）方法。数值实验表明，Hessian平均技术不仅有助于提升收敛性能，还能在ResNet处理CIFAR100分类等复杂问题上实现当前最优的性能表现。