Slicing Mutual Information Generalization Bounds for Neural Networks

The ability of machine learning (ML) algorithms to generalize well to unseen data has been studied through the lens of information theory, by bounding the generalization error with the input-output mutual information (MI), i.e., the MI between the training data and the learned hypothesis. Yet, these bounds have limited practicality for modern ML applications (e.g., deep learning), due to the difficulty of evaluating MI in high dimensions. Motivated by recent findings on the compressibility of neural networks, we consider algorithms that operate by slicing the parameter space, i.e., trained on random lower-dimensional subspaces. We introduce new, tighter information-theoretic generalization bounds tailored for such algorithms, demonstrating that slicing improves generalization. Our bounds offer significant computational and statistical advantages over standard MI bounds, as they rely on scalable alternative measures of dependence, i.e., disintegrated mutual information and $k$-sliced mutual information. Then, we extend our analysis to algorithms whose parameters do not need to exactly lie on random subspaces, by leveraging rate-distortion theory. This strategy yields generalization bounds that incorporate a distortion term measuring model compressibility under slicing, thereby tightening existing bounds without compromising performance or requiring model compression. Building on this, we propose a regularization scheme enabling practitioners to control generalization through compressibility. Finally, we empirically validate our results and achieve the computation of non-vacuous information-theoretic generalization bounds for neural networks, a task that was previously out of reach.

翻译：机器学习算法对未见数据的泛化能力可通过信息论视角研究，通过用输入-输出互信息（即训练数据与所学假设之间的互信息）界定泛化误差。然而，由于高维空间中互信息难以评估，这些界对现代机器学习应用（如深度学习）的实用性有限。受近期神经网络可压缩性研究的启发，我们考虑通过切片参数空间（即在随机低维子空间上训练）进行操作的算法。我们针对此类算法引入更紧致的新信息论泛化界，证明切片操作能改善泛化性能。相较于标准互信息界，我们的界具有显著的计算和统计优势，因其依赖于可扩展的依赖关系替代度量——分解互信息和k-切片互信息。随后，我们通过率失真理论将分析扩展至参数无需精确位于随机子空间的算法。该策略生成的泛化界包含一个衡量切片下模型可压缩性的失真项，从而在不牺牲性能或要求模型压缩的情况下收紧现有界。基于此，我们提出一种正则化方案，使实践者可通过可压缩性控制泛化。最后，我们通过实验验证了结果，并实现了对神经网络计算非平凡信息论泛化界这一此前难以完成的任务。