This paper explores the generalization characteristics of iterative learning algorithms with bounded updates for non-convex loss functions, employing information-theoretic techniques. Our key contribution is a novel bound for the generalization error of these algorithms with bounded updates, extending beyond the scope of previous works that only focused on Stochastic Gradient Descent (SGD). Our approach introduces two main novelties: 1) we reformulate the mutual information as the uncertainty of updates, providing a new perspective, and 2) instead of using the chaining rule of mutual information, we employ a variance decomposition technique to decompose information across iterations, allowing for a simpler surrogate process. We analyze our generalization bound under various settings and demonstrate improved bounds when the model dimension increases at the same rate as the number of training data samples. To bridge the gap between theory and practice, we also examine the previously observed scaling behavior in large language models. Ultimately, our work takes a further step for developing practical generalization theories.
翻译:本文采用信息论技术,探讨了具有有界更新的迭代学习算法在非凸损失函数下的泛化特性。我们的主要贡献是为这些具有有界更新的算法提出了一个新颖的泛化误差界,其适用范围超越了以往仅关注随机梯度下降(SGD)的研究。我们的方法引入了两个主要创新:1)我们将互信息重新表述为更新的不确定性,提供了新的视角;2)我们采用方差分解技术替代互信息的链式法则来分解迭代间的信息,从而允许一个更简单的替代过程。我们在多种设定下分析了泛化界,并证明当模型维度与训练数据样本数量以相同速率增加时,我们的界得到了改善。为弥合理论与实践的差距,我们还考察了先前在大语言模型中观察到的尺度行为。最终,我们的工作为发展实用的泛化理论迈出了进一步的一步。