We study the bias of Stochastic Gradient Descent (SGD) to learn low-rank weight matrices when training deep ReLU neural networks. Our results show that training neural networks with mini-batch SGD and weight decay causes a bias towards rank minimization over the weight matrices. Specifically, we show, both theoretically and empirically, that this bias is more pronounced when using smaller batch sizes, higher learning rates, or increased weight decay. Additionally, we predict and observe empirically that weight decay is necessary to achieve this bias. Unlike previous literature, our analysis does not rely on assumptions about the data, convergence, or optimality of the weight matrices and applies to a wide range of neural network architectures of any width or depth. Finally, we empirically investigate the connection between this bias and generalization, finding that it has a marginal effect on generalization.
翻译:我们研究了使用随机梯度下降(SGD)训练深度ReLU神经网络时,学习低秩权重矩阵的偏置特性。实验结果表明,采用小批量SGD和权重衰减训练神经网络,会导致权重矩阵向秩最小化方向产生偏置。具体而言,我们从理论和实验两方面证明:当使用更小的批量大小、更高的学习率或更强的权重衰减时,这种偏置会更加显著。此外,我们通过理论预测和实验观察发现,权重衰减是实现该偏置的必要条件。与现有文献不同,我们的分析不依赖于关于数据、收敛性或权重矩阵最优性的假设,且适用于任意宽度或深度的多种神经网络架构。最后,我们通过实验探究了该偏置与泛化性能之间的关联,发现其对泛化能力的影响较为微弱。