We study the scaling limits of stochastic gradient descent (SGD) with constant step-size in the high-dimensional regime. We prove limit theorems for the trajectories of summary statistics (i.e., finite-dimensional functions) of SGD as the dimension goes to infinity. Our approach allows one to choose the summary statistics that are tracked, the initialization, and the step-size. It yields both ballistic (ODE) and diffusive (SDE) limits, with the limit depending dramatically on the former choices. We show a critical scaling regime for the step-size, below which the effective ballistic dynamics matches gradient flow for the population loss, but at which, a new correction term appears which changes the phase diagram. About the fixed points of this effective dynamics, the corresponding diffusive limits can be quite complex and even degenerate. We demonstrate our approach on popular examples including estimation for spiked matrix and tensor models and classification via two-layer networks for binary and XOR-type Gaussian mixture models. These examples exhibit surprising phenomena including multimodal timescales to convergence as well as convergence to sub-optimal solutions with probability bounded away from zero from random (e.g., Gaussian) initializations. At the same time, we demonstrate the benefit of overparametrization by showing that the latter probability goes to zero as the second layer width grows.
翻译:我们研究高维机制下具有固定步长的随机梯度下降(SGD)的标度极限。随着维数趋于无穷,我们证明了SGD的汇总统计量(即有限维函数)轨迹的极限定理。我们的方法允许选择被追踪的汇总统计量、初始化和步长,从而同时得到弹道(ODE)极限和扩散(SDE)极限,且极限形式显著依赖于前述选择。我们揭示了步长的临界标度区域:当步长低于该临界值时,有效弹道动力学与群体损失的梯度流一致;但在临界点上,会出现改变相图的新修正项。在该有效动力学的固定点附近,对应的扩散极限可能相当复杂甚至退化。我们在多种经典案例中验证了该方法,包括:尖峰矩阵和张量模型的估计,以及通过双层网络对二元和高斯混合XOR型混合模型进行分类。这些案例展现出令人惊讶的现象,包括多模态收敛时间尺度,以及从随机(如高斯)初始化时以非零概率收敛到次优解。同时,我们通过证明该概率随第二层宽度增加而趋于零,展示了过参数化的优势。