In this paper, we establish novel data-dependent upper bounds on the generalization error through the lens of a "variable-size compressibility" framework that we introduce newly here. In this framework, the generalization error of an algorithm is linked to a variable-size 'compression rate' of its input data. This is shown to yield bounds that depend on the empirical measure of the given input data at hand, rather than its unknown distribution. Our new generalization bounds that we establish are tail bounds, tail bounds on the expectation, and in-expectations bounds. Moreover, it is shown that our framework also allows to derive general bounds on any function of the input data and output hypothesis random variables. In particular, these general bounds are shown to subsume and possibly improve over several existing PAC-Bayes and data-dependent intrinsic dimension-based bounds that are recovered as special cases, thus unveiling a unifying character of our approach. For instance, a new data-dependent intrinsic dimension-based bound is established, which connects the generalization error to the optimization trajectories and reveals various interesting connections with the rate-distortion dimension of a process, the R\'enyi information dimension of a process, and the metric mean dimension.
翻译:本文通过新引入的“变长可压缩性”框架,建立了关于泛化误差的新型数据依赖上界。在该框架中,算法的泛化误差与输入数据的变长“压缩率”相关联。研究证明,该框架所得出的上界依赖于给定输入数据的经验测度,而非其未知分布。我们建立的泛化界新形式包括尾界、期望的尾界及期望界。此外,本文表明该框架还可推导出关于输入数据与输出假设随机变量任意函数的一般性界。特别地,这些一般性界能够涵盖并可能改进若干现有的PAC-Bayes界及数据依赖的内蕴维数界(这些界作为特例被恢复),从而揭示了本方法的统一性特征。例如,本文建立了一个新的数据依赖内蕴维数界,该界将泛化误差与优化轨迹相联系,并揭示了其与过程的率失真维数、Rényi信息维数及度量平均维数之间多种有趣的联系。