We study the scaling limits of stochastic gradient descent (SGD) with constant step-size in the high-dimensional regime. We prove limit theorems for the trajectories of summary statistics (i.e., finite-dimensional functions) of SGD as the dimension goes to infinity. Our approach allows one to choose the summary statistics that are tracked, the initialization, and the step-size. It yields both ballistic (ODE) and diffusive (SDE) limits, with the limit depending dramatically on the former choices. We show a critical scaling regime for the step-size, below which the effective ballistic dynamics matches gradient flow for the population loss, but at which, a new correction term appears which changes the phase diagram. About the fixed points of this effective dynamics, the corresponding diffusive limits can be quite complex and even degenerate. We demonstrate our approach on popular examples including estimation for spiked matrix and tensor models and classification via two-layer networks for binary and XOR-type Gaussian mixture models. These examples exhibit surprising phenomena including multimodal timescales to convergence as well as convergence to sub-optimal solutions with probability bounded away from zero from random (e.g., Gaussian) initializations. At the same time, we demonstrate the benefit of overparametrization by showing that the latter probability goes to zero as the second layer width grows.
翻译:研究常数步长随机梯度下降(SGD)在高维场景下的标度极限。我们证明当维度趋于无穷时,SGD轨迹汇总统计量(即有限维函数)的极限定理。该方法允许选择跟踪的汇总统计量、初始化和步长,同时得到弹道极限(常微分方程)和扩散极限(随机微分方程),而极限形式显著依赖于前述选择。我们揭示步长的临界标度域:低于该标度时有效弹道动力学与群体损失的梯度流一致,但临界标度处会涌现修正项并改变相图。在该有效动力学不动点处,相应扩散极限可能极为复杂甚至退化。我们通过流行案例验证方法有效性,包括尖峰矩阵与张量模型估计、二元及异或型高斯混合模型的二层网络分类。这些案例呈现多重时间尺度收敛现象,且从随机初始化(如高斯初始化)出发时,存在非零概率收敛到次优解。同时,我们通过证明上述概率随第二层宽度增大而趋于零,揭示了过参数化的优势。