We analyze the one-pass stochastic gradient descent dynamics of a two-layer neural network with quadratic activations in a teacher--student framework. In the high-dimensional regime, where the input dimension $N$ and the number of samples $M$ diverge at fixed ratio $α= M/N$, and for finite hidden widths $(p,p^*)$ of the student and teacher, respectively, we study the low-dimensional ordinary differential equations that govern the evolution of the student--teacher and student--student overlap matrices. We show that overparameterization ($p>p^*$) only modestly accelerates escape from a plateau of poor generalization by modifying the prefactor of the exponential decay of the loss. We then examine how unconstrained weight norms introduce a continuous rotational symmetry that results in a nontrivial manifold of zero-loss solutions for $p>1$. From this manifold the dynamics consistently selects the closest solution to the random initialization, as enforced by a conserved quantity in the ODEs governing the evolution of the overlaps. Finally, a Hessian analysis of the population-loss landscape confirms that the plateau and the solution manifold correspond to saddles with at least one negative eigenvalue and to marginal minima in the population-loss geometry, respectively.
翻译:我们研究了教师-学生框架下,具有二次激活函数的双层神经网络在单轮随机梯度下降(SGD)中的动力学过程。在高维情形下(即输入维度$N$与样本数$M$以固定比率$α= M/N$发散),且学生与教师网络的隐藏层宽度$(p,p^*)$有限时,我们推导了控制学生-教师和学生-学生重叠矩阵演化的低维常微分方程(ODE)。研究表明,过参数化($p>p^*$)通过改变损失函数指数衰减的前因子,仅能适度加速从泛化不良的“高原”区域逃逸的过程。进一步,我们发现无约束的权重范数引入连续旋转对称性,导致当$p>1$时存在一个零解的非平凡流形。该流形上,重叠演化ODE所蕴含的守恒量强制动力学过程始终选择最接近随机初始化的解。最终,通过对总体损失景观的海森矩阵分析,我们确认该“高原”区域对应至少具有一个负特征值的鞍点,而解流形则对应总体损失几何中的边际极小值。