Stochastic gradient descent (SGD) is a workhorse algorithm for solving large-scale optimization problems in data science and machine learning. Understanding the convergence of SGD is hence of fundamental importance. In this work we examine the SGD convergence (with various step sizes) when applied to unconstrained convex quadratic programming (essentially least-squares (LS) problems), and in particular analyze the error components respect to the eigenvectors of the Hessian. The main message is that the convergence depends largely on the corresponding eigenvalues (singular values of the coefficient matrix in the LS context), namely the components for the large singular values converge faster in the initial phase. We then show there is a phase transition in the convergence where the convergence speed of the components, especially those corresponding to the larger singular values, will decrease. Finally, we show that the convergence of the overall error (in the solution) tends to decay as more iterations are run, that is, the initial convergence is faster than the asymptote.
翻译:随机梯度下降(SGD)是数据科学与机器学习中解决大规模优化问题的核心算法,因此理解其收敛性具有基础性意义。本文研究了SGD(采用多种步长设置)应用于无约束凸二次规划(本质上是最小二乘问题)时的收敛行为,特别针对误差在Hessian矩阵特征向量方向上的分量进行了分析。核心结论表明:收敛特性主要取决于对应的特征值(在最小二乘问题中即系数矩阵的奇异值),具体表现为大奇异值对应的分量在迭代初期收敛更快。我们进一步揭示了收敛过程中存在阶段性转变,即各分量(尤其是对应较大奇异值的分量)的收敛速度将会下降。最后,我们证明随着迭代次数增加,整体解误差的收敛速率趋于衰减,即初始阶段的收敛速度快于渐近收敛速度。