We investigate the stationary (late-time) training regime of single- and two-layer linear underparameterized neural networks within the continuum limit of stochastic gradient descent (SGD) for synthetic Gaussian data. In the case of a single-layer network in the weakly underparameterized regime, the spectrum of the noise covariance matrix deviates notably from the Hessian, which can be attributed to the broken detailed balance of SGD dynamics. The weight fluctuations are in this case generally anisotropic, but are subject to an isotropic loss. For a two-layer network, we obtain the stochastic dynamics of the weights in each layer and analyze the associated stationary covariances. We identify the inter-layer coupling as a new source of anisotropy for the weight fluctuations. In contrast to the single-layer case, the weight fluctuations experience an anisotropic loss, the flatness of which is inversely related to the fluctuation variance. We thereby provide an analytical derivation of the recently observed inverse variance-flatness relation in a model of a deep linear neural network.
翻译:我们研究了单层和双层欠参数化线性神经网络在随机梯度下降(SGD)连续极限下的稳态(晚期)训练行为,采用合成高斯数据。对于单层网络在弱欠参数化情形下,噪声协方差矩阵的谱显著偏离Hessian矩阵,这一现象可归因于SGD动力学中细致平衡的破缺。此时权重波动通常呈现各向异性,但受各向同性损失函数约束。针对双层网络,我们推导了各层权重的随机动力学,并分析了相应的稳态协方差,发现层间耦合成为权重波动各向异性的新来源。与单层情况不同,双层网络的权重波动受各向异性损失函数的约束,该损失函数的平坦度与波动方差呈负相关。由此,我们为近期在深度线性神经网络模型中观察到的逆方差-平坦度关系提供了分析性推导。