Minimum Description Length and Generalization Guarantees for Representation Learning

A major challenge in designing efficient statistical supervised learning algorithms is finding representations that perform well not only on available training samples but also on unseen data. While the study of representation learning has spurred much interest, most existing such approaches are heuristic; and very little is known about theoretical generalization guarantees. In this paper, we establish a compressibility framework that allows us to derive upper bounds on the generalization error of a representation learning algorithm in terms of the "Minimum Description Length" (MDL) of the labels or the latent variables (representations). Rather than the mutual information between the encoder's input and the representation, which is often believed to reflect the algorithm's generalization capability in the related literature but in fact, falls short of doing so, our new bounds involve the "multi-letter" relative entropy between the distribution of the representations (or labels) of the training and test sets and a fixed prior. In particular, these new bounds reflect the structure of the encoder and are not vacuous for deterministic algorithms. Our compressibility approach, which is information-theoretic in nature, builds upon that of Blum-Langford for PAC-MDL bounds and introduces two essential ingredients: block-coding and lossy-compression. The latter allows our approach to subsume the so-called geometrical compressibility as a special case. To the best knowledge of the authors, the established generalization bounds are the first of their kind for Information Bottleneck (IB) type encoders and representation learning. Finally, we partly exploit the theoretical results by introducing a new data-dependent prior. Numerical simulations illustrate the advantages of well-chosen such priors over classical priors used in IB.

翻译：设计高效统计监督学习算法的主要挑战在于，找到既能良好拟合现有训练样本，又能泛化至未见过数据的表示。尽管表示学习研究已引发广泛关注，但现有方法大多基于启发式策略，其理论泛化保证尚不明确。本文建立了一种可压缩性框架，通过标签或潜变量（表示）的"最小描述长度"（MDL），推导出表示学习算法泛化误差的上界。与相关文献中常被认为反映算法泛化能力但实际存在缺陷的编码器输入与表示间的互信息不同，我们的新界涉及训练集与测试集表示（或标签）分布与固定先验之间的"多字母"相对熵。特别地，这些新界体现了编码器的结构特性，且对确定性算法非平凡。我们的可压缩性方法本质上是信息论性质的，其构建于Blum-Langford针对PAC-MDL界的研究之上，并引入了两个关键要素：块编码与有损压缩。后者使我们的方法能够将所谓的几何可压缩性作为特例。据作者所知，所建立的泛化界是针对信息瓶颈（IB）型编码器与表示学习的首个此类理论结果。最后，我们通过引入一种新的数据相关先验，部分利用了理论成果。数值仿真展示了选取恰当先验相较于IB中经典先验的优越性。