This paper presents a novel information-theoretic perspective on generalization in machine learning by framing the learning problem within the context of lossy compression and applying finite blocklength analysis. In our approach, the sampling of training data formally corresponds to an encoding process, and the model construction to a decoding process. By leveraging finite blocklength analysis, we derive lower bounds on sample complexity and generalization error for a fixed randomized learning algorithm and its associated optimal sampling strategy. Our bounds explicitly characterize the degree of overfitting of the learning algorithm and the mismatch between its inductive bias and the task as distinct terms. This separation provides a significant advantage over existing frameworks. Additionally, we decompose the overfitting term to show its theoretical connection to existing metrics found in information-theoretic bounds and stability theory, unifying these perspectives under our proposed framework.
翻译:本文提出了一种新颖的信息论视角来审视机器学习中的泛化问题,将学习问题置于有损压缩的框架下并应用有限码长分析。在我们的方法中,训练数据的采样过程形式化地对应于编码过程,而模型构建则对应于解码过程。通过利用有限码长分析,我们针对固定的随机化学习算法及其相关的最优采样策略,推导出样本复杂度和泛化误差的下界。我们的界明确地将学习算法的过拟合程度及其归纳偏置与任务之间的不匹配性刻画为两个独立的项。这种分离相较于现有框架具有显著优势。此外,我们进一步分解过拟合项,以揭示其与现有信息论界和稳定性理论中已有度量指标的理论联系,从而将这些视角统一于我们提出的框架之下。