This paper explores the generalization characteristics of iterative learning algorithms with bounded updates for non-convex loss functions, employing information-theoretic techniques. Our key contribution is a novel bound for the generalization error of these algorithms with bounded updates, extending beyond the scope of previous works that only focused on Stochastic Gradient Descent (SGD). Our approach introduces two main novelties: 1) we reformulate the mutual information as the uncertainty of updates, providing a new perspective, and 2) instead of using the chaining rule of mutual information, we employ a variance decomposition technique to decompose information across iterations, allowing for a simpler surrogate process. We analyze our generalization bound under various settings and demonstrate improved bounds when the model dimension increases at the same rate as the number of training data samples. To bridge the gap between theory and practice, we also examine the previously observed scaling behavior in large language models. Ultimately, our work takes a further step for developing practical generalization theories.
翻译:本文采用信息论技术,深入探究了带界更新迭代学习算法在非凸损失函数下的泛化特性。我们的主要贡献在于,针对此类带界更新算法,提出了一种新颖的泛化误差界,其适用范围超越了以往仅聚焦于随机梯度下降(SGD)的研究工作。我们的方法引入两大创新:1)将互信息重新表述为更新的不确定性,提供了全新视角;2)摒弃了互信息的链式法则,转而采用方差分解技术来分解迭代过程中的信息,从而构建了一个更简洁的替代过程。我们在多种设置下分析了所提出的泛化界,并证明当模型维度与训练数据样本数量同步增长时,该泛化界能得到改善。为弥合理论与实践的鸿沟,我们还考察了大型语言模型中先前观测到的缩放行为。最终,我们的工作为发展实用的泛化理论迈出了新的一步。