We study the discrete dynamics of mini-batch gradient descent with random reshuffling for least squares regression. We show that the training and generalization errors depend on a sample cross-covariance matrix $Z$ between the original features $X$ and a set of new features $\widetilde{X}$ in which each feature is modified by the mini-batches that appear before it during the learning process in an averaged way. Using this representation, we establish that the dynamics of mini-batch and full-batch gradient descent agree up to leading order with respect to the step size using the linear scaling rule. However, mini-batch gradient descent with random reshuffling exhibits a subtle dependence on the step size that a gradient flow analysis cannot detect, such as converging to a limit that depends on the step size. By comparing $Z$, a non-commutative polynomial of random matrices, with the sample covariance matrix of $X$ asymptotically, we demonstrate that batching affects the dynamics by resulting in a form of shrinkage on the spectrum.
翻译:我们研究了随机重排小批量梯度下降在最小二乘回归中的离散动态。研究表明,训练误差与泛化误差取决于原始特征$X$与一组新特征$\widetilde{X}$之间的样本互协方差矩阵$Z$,其中每个特征在学习过程中以平均方式受其之前出现的小批量数据所修正。基于此表示,我们证明了在使用线性缩放规则时,小批量梯度下降与全批量梯度下降的动态在步长的一阶近似下保持一致。然而,随机重排小批量梯度下降表现出对步长的微妙依赖性(例如收敛至依赖步长的极限),这是梯度流分析所无法捕捉的。通过将随机矩阵的非交换多项式$Z$与$X$的样本协方差矩阵进行渐近比较,我们证明批处理通过导致谱的某种收缩形式影响动态过程。