The largest eigenvalue of the Hessian, or sharpness, of neural networks is a key quantity to understand their optimization dynamics. In this paper, we study the sharpness of deep linear networks for univariate regression. Minimizers can have arbitrarily large sharpness, but not an arbitrarily small one. Indeed, we show a lower bound on the sharpness of minimizers, which grows linearly with depth. We then study the properties of the minimizer found by gradient flow, which is the limit of gradient descent with vanishing learning rate. We show an implicit regularization towards flat minima: the sharpness of the minimizer is no more than a constant times the lower bound. The constant depends on the condition number of the data covariance matrix, but not on width or depth. This result is proven both for a small-scale initialization and a residual initialization. Results of independent interest are shown in both cases. For small-scale initialization, we show that the learned weight matrices are approximately rank-one and that their singular vectors align. For residual initialization, convergence of the gradient flow for a Gaussian initialization of the residual network is proven. Numerical experiments illustrate our results and connect them to gradient descent with non-vanishing learning rate.
翻译:神经网络Hessian矩阵的最大特征值(即锐度)是理解其优化动态的关键量。本文研究用于单变量回归的深度线性网络的锐度特性。极小值点可具有任意大的锐度,但无法达到任意小的锐度。我们证明了极小值点锐度的下界随网络深度线性增长。随后我们研究梯度流(即学习率趋近于零时梯度下降的极限)所寻得极小值点的性质。我们揭示了其隐含的正则化趋向平坦极小值现象:该极小值点的锐度不超过下界的常数倍。该常数取决于数据协方差矩阵的条件数,而与网络宽度或深度无关。此结论在小型初始化与残差初始化两种情况下均得到证明。两种情形下均获得了具有独立价值的研究成果:对于小型初始化,我们证明习得的权重矩阵近似为秩一矩阵且其奇异向量相互对齐;对于残差初始化,我们证明了高斯初始化残差网络的梯度流收敛性。数值实验验证了理论结果,并将其与固定学习率的梯度下降方法建立了联系。