We study the bias of Stochastic Gradient Descent (SGD) to learn low-rank weight matrices when training deep ReLU neural networks. Our results show that training neural networks with mini-batch SGD and weight decay causes a bias towards rank minimization over the weight matrices. Specifically, we show, both theoretically and empirically, that this bias is more pronounced when using smaller batch sizes, higher learning rates, or increased weight decay. Additionally, we predict and observe empirically that weight decay is necessary to achieve this bias. In addition, we show that in the presence of intermediate neural collapse, the learned weights are particularly low-rank. Unlike previous literature, our analysis does not rely on assumptions about the data, convergence, or optimality of the weight matrices. Furthermore, it applies to a wide range of neural network architectures of any width or depth. Finally, we empirically investigate the connection between this bias and generalization, finding that it has a marginal effect on generalization.
翻译:我们研究了随机梯度下降(SGD)在训练深度ReLU神经网络时学习低秩权重矩阵的偏置特性。实验结果表明,使用小批量SGD和权重衰减训练神经网络会导致权重矩阵向秩最小化方向产生偏置。具体而言,我们从理论和实证两方面证明:当使用更小的批量大小、更高的学习率或更强的权重衰减时,这种偏置更为显著。此外,我们预测并通过实验观察到权重衰减是实现该偏置的必要条件。同时,我们证明在存在中间神经坍缩的情况下,学习到的权重矩阵呈现特别低的秩。与先前研究不同,我们的分析不依赖于关于数据、收敛性或权重矩阵最优性的假设。此外,该结论适用于任意宽度或深度的广泛神经网络架构。最后,我们通过实验探究了该偏置与泛化能力之间的关联,发现其对泛化性能的影响较为微弱。