In this paper, we study the well-known "Heavy Ball" method for convex and nonconvex optimization introduced by Polyak in 1964, and establish its convergence under a variety of situations. Traditionally, most algorthms use "full-coordinate update," that is, at each step, very component of the argument is updated. However, when the dimension of the argument is very high, it is more efficient to update some but not all components of the argument at each iteration. We refer to this as "batch updating" in this paper. When gradient-based algorithms are used together with batch updating, in principle it is sufficient to compute only those components of the gradient for which the argument is to be updated. However, if a method such as back propagation is used to compute these components, computing only some components of gradient does not offer much savings over computing the entire gradient. Therefore, to achieve a noticeable reduction in CPU usage at each step, one can use first-order differences to approximate the gradient. The resulting estimates are biased, and also have unbounded variance. Thus some delicate analysis is required to ensure that the HB algorithm converge when batch updating is used instead of full-coordinate updating, and/or approximate gradients are used instead of true gradients. In this paper, we not only establish the almost sure convergence of the iterations to the stationary point(s) of the objective function, but also derive upper bounds on the rate of convergence. To the best of our knowledge, there is no other paper that combines all of these features.
翻译:本文研究了波利亚克于1964年提出的著名“重球”方法在凸优化和非凸优化中的应用,并建立了其在不同情形下的收敛性。传统上,大多数算法采用“全坐标更新”,即每一步更新自变量的所有分量。然而,当自变量维度极高时,每步仅更新部分分量更具效率,本文将此称为“批量更新”。当基于梯度的算法与批量更新结合使用时,原则上只需计算待更新分量对应的梯度分量。但若采用反向传播等方法计算这些分量,仅计算部分梯度分量相比计算完整梯度节省的算力有限。因此,为显著降低每步的CPU占用,可利用一阶差分近似梯度。此类估计存在偏差且方差无界,故需精细分析以确保重球算法在采用批量更新替代全坐标更新、和/或用近似梯度替代真实梯度时的收敛性。本文不仅证明了迭代序列几乎必然收敛至目标函数的平稳点,还推导了收敛速率的上界。据我们所知,目前尚无文献同时涵盖以上所有特征。