In this paper, we establish novel data-dependent upper bounds on the generalization error through the lens of a "variable-size compressibility" framework that we introduce newly here. In this framework, the generalization error of an algorithm is linked to a variable-size 'compression rate' of its input data. This is shown to yield bounds that depend on the empirical measure of the given input data at hand, rather than its unknown distribution. Our new generalization bounds that we establish are tail bounds, tail bounds on the expectation, and in-expectations bounds. Moreover, it is shown that our framework also allows to derive general bounds on any function of the input data and output hypothesis random variables. In particular, these general bounds are shown to subsume and possibly improve over several existing PAC-Bayes and data-dependent intrinsic dimension-based bounds that are recovered as special cases, thus unveiling a unifying character of our approach. For instance, a new data-dependent intrinsic dimension based bounds is established, which connects the generalization error to the optimization trajectories and reveals various interesting connections with rate-distortion dimension of process, R\'enyi information dimension of process, and metric mean dimension.
翻译:本文通过新提出的“可变大小压缩性”框架,建立了泛化误差的新型数据依赖上界。在该框架下,算法的泛化误差与其输入数据的可变大小“压缩率”相关联。研究表明,该框架所得边界依赖于给定输入数据的经验测度,而非其未知分布。我们建立的新泛化界包括尾部界、期望尾部界以及期望界。此外,该框架还可推导输入数据与输出假设随机变量任意函数的通用泛化界。特别地,这些通用泛化界能够包含并可能改进若干现有PAC-Bayes界及基于数据依赖本征维数的边界(作为特例恢复),从而揭示了本方法的统一性。例如,本文建立了基于数据依赖本征维数的新界,该界将泛化误差与优化轨迹相关联,并揭示了与过程率失真维数、过程Rényi信息维数及度量平均维数的多种有趣联系。