Information-theoretic (IT) generalization bounds have been used to study the generalization of learning algorithms. These bounds are intrinsically data- and algorithm-dependent so that one can exploit the properties of data and algorithm to derive tighter bounds. However, we observe that although the flatness bias is crucial for SGD's generalization, these bounds fail to capture the improved generalization under better flatness and are also numerically loose. This is caused by the inadequate leverage of SGD's flatness bias in existing IT bounds. This paper derives a more flatness-leveraging IT bound for the flatness-favoring SGD. The bound indicates the learned models generalize better if the large-variance directions of the final weight covariance have small local curvatures in the loss landscape. Experiments on deep neural networks show our bound not only correctly reflects the better generalization when flatness is improved, but is also numerically much tighter. This is achieved by a flexible technique called "omniscient trajectory". When applied to Gradient Descent's minimax excess risk on convex-Lipschitz-Bounded problems, it improves representative IT bounds' $Ω(1)$ rates to $O(1/\sqrt{n})$. It also implies a by-pass of memorization-generalization trade-offs.
翻译:信息论(IT)泛化界已被用于研究学习算法的泛化性能。这些界本质上是数据与算法相关的,因此可以利用数据和算法的特性来推导更紧的界。然而,我们观察到:尽管平坦性偏置对SGD的泛化至关重要,现有IT界既未能捕捉到平坦性改善时泛化能力的提升,其数值估计也较为宽松。这源于现有IT界未能充分运用SGD的平坦性偏置。本文针对偏好平坦性的SGD推导出更充分利用平坦性的IT界。该界表明:若最终权重协方差的大方差方向在损失景观中具有较小的局部曲率,则学习模型具有更好的泛化能力。在深度神经网络上的实验表明,我们的界不仅能正确反映平坦性改善时泛化能力的提升,其数值估计也显著更紧。这是通过一种称为“全知轨迹”的灵活技术实现的。当应用于凸-Lipschitz-有界问题上梯度下降的极小化超额风险时,该技术将代表性IT界的$Ω(1)$速率改进为$O(1/\sqrt{n})$。该结果还暗示了记忆化-泛化权衡的规避路径。