Block coordinate descent is a powerful algorithmic template suitable for big data optimization. This template admits a lot of variants including block gradient descent (BGD), which performs gradient descent on a selected block of variables, while keeping other variables fixed. For a very long time, the stepsize for each block has tacitly been set to one divided by the block-wise Lipschitz smoothness constant, imitating the vanilla stepsize rule for gradient descent (GD). However, such a choice for BGD has not yet been able to theoretically justify its empirical superiority over GD, as existing convergence rates for BGD have worse constants than GD in the deterministic cases. To discover such theoretical justification, we set up a simple environment where we consider BGD applied to least-squares with two blocks of variables. Assuming the data matrix corresponding to each block is orthogonal, we find optimal stepsizes of BGD in closed form, which provably lead to asymptotic convergence rates twice as fast as GD with Polyak's momentum; this means, under that orthogonality assumption, one can accelerate BGD by just tuning stepsizes and without adding any momentum. An application that satisfies this assumption is \textit{generalized alternating projection} between two subspaces, and applying our stepsizes to it improves the prior convergence rate that was once claimed, slightly inaccurately, to be optimal. The main proof idea is to minimize, in stepsize variables, the spectral radius of a matrix that controls convergence rates.
翻译:块坐标下降是一种适用于大数据优化的强大算法框架。该框架包含多种变体,其中块梯度下降(BGD)在选定变量块上执行梯度下降,同时保持其他变量固定。长期以来,每个块的步长通常被隐式地设置为块李普希茨光滑常数的倒数,这模仿了梯度下降(GD)的经典步长规则。然而,这种BGD步长选择尚未能从理论上证明其相对于GD的经验优势,因为在确定性情况下,现有BGD收敛速率中的常数项均差于GD。为探寻理论依据,我们建立了一个简单场景:考虑将BGD应用于具有两个变量块的最小二乘问题。假设每个块对应的数据矩阵正交,我们以闭式形式推导出BGD的最优步长,并证明其能带来比带Polyak动量的GD快两倍的渐近收敛速率;这意味着在该正交性假设下,仅通过调整步长而无需添加任何动量即可加速BGD。满足该假设的一个应用场景是子空间间的\textit{广义交替投影},将我们的步长应用于该场景可改进先前曾被(略有偏差地)宣称最优的收敛速率。核心证明思路在于:通过最小化控制收敛速率的矩阵谱半径来优化步长变量。