We derive a differential equation that governs the evolution of the generalization gap when a deep network is trained by gradient descent. This differential equation is controlled by two quantities, a contraction factor that brings together trajectories corresponding to slightly different datasets, and a perturbation factor that accounts for them training on different datasets. We analyze this differential equation to compute an ``effective Gram matrix'' that characterizes the generalization gap in terms of the alignment between this Gram matrix and a certain initial ``residual''. Empirical evaluations on image classification datasets indicate that this analysis can predict the test loss accurately. Further, during training, the residual predominantly lies in the subspace of the effective Gram matrix with the smallest eigenvalues. This indicates that the generalization gap accumulates slowly along the direction of training, charactering a benign training process. We provide novel perspectives for explaining the generalization ability of neural network training with different datasets and architectures through the alignment pattern of the ``residual" and the ``effective Gram matrix".
翻译:我们推导了一个微分方程,用于描述深度网络在梯度下降训练过程中泛化间隙的演化规律。该微分方程受两个量控制:一个收缩因子,使对应于略有不同数据集的训练轨迹相互靠近;一个扰动因子,用于解释在不同数据集上的训练差异。通过分析该微分方程,我们计算出一个“有效格拉姆矩阵”,该矩阵通过其与特定初始“残差”的对齐关系来表征泛化间隙。在图像分类数据集上的实证评估表明,该分析能够准确预测测试损失。此外,在训练过程中,残差主要位于有效格拉姆矩阵最小特征值对应的子空间中。这表明泛化间隙沿训练方向缓慢累积,表征了训练过程的良性特性。我们通过“残差”与“有效格拉姆矩阵”的对齐模式,为解释不同数据集和架构下神经网络训练的泛化能力提供了新的理论视角。