The geometry of invariant learning: an information-theoretic analysis of data augmentation and generalization

Data augmentation is one of the most widely used techniques to improve generalization in modern machine learning, often justified by its ability to promote invariance to label-irrelevant transformations. However, its theoretical role remains only partially understood. In this work, we propose an information-theoretic framework that systematically accounts for the effect of augmentation on generalization and invariance learning. Our approach builds upon mutual information-based bounds, which relate the generalization gap to the amount of information a learning algorithm retains about its training data. We extend this framework by modeling the augmented distribution as a composition of the original data distribution with a distribution over transformations, which naturally induces an orbit-averaged loss function. Under mild sub-Gaussian assumptions on the loss function and the augmentation process, we derive a new generalization bound that decompose the expected generalization gap into three interpretable terms: (1) a distributional divergence between the original and augmented data, (2) a stability term measuring the algorithm dependence on training data, and (3) a sensitivity term capturing the effect of augmentation variability. To connect our bounds to the geometry of the augmentation group, we introduce the notion of group diameter, defined as the maximal perturbation that augmentations can induce in the input space. The group diameter provides a unified control parameter that bounds all three terms and highlights an intrinsic trade-off: small diameters preserve data fidelity but offer limited regularization, while large diameters enhance stability at the cost of increased bias and sensitivity. We validate our theoretical bounds with numerical experiments, demonstrating that it reliably tracks and predicts the behavior of the true generalization gap.

翻译：数据增强是现代机器学习中应用最广泛的技术之一，常用于提升模型泛化能力，其合理性通常源于它能够促进对标签无关变换的不变性。然而，其理论作用仍未得到充分理解。本文提出一个信息论框架，系统性地解释数据增强对泛化与不变性学习的影响。我们的方法建立在基于互信息的界之上，这类界将泛化差距与学习算法保留其训练数据的信息量联系起来。我们通过将增强分布建模为原始数据分布与变换分布的组合来扩展该框架，这自然导出了一个轨道平均损失函数。在损失函数与增强过程满足温和次高斯假设的条件下，我们推导出一个新的泛化界，将期望泛化差距分解为三个可解释的项：(1) 原始数据与增强数据之间的分布散度，(2) 衡量算法对训练数据依赖性的稳定性项，以及 (3) 捕捉增强变异性的敏感性项。为将我们的界与增强群的几何结构联系起来，我们引入了群直径的概念，其定义为增强在输入空间中能诱导的最大扰动。群直径提供了一个统一的控制参数，可同时约束上述三项，并揭示了一个内在权衡：较小的直径能保持数据保真度但提供的正则化有限，而较大的直径以增加偏差和敏感性为代价来增强稳定性。我们通过数值实验验证了理论界的有效性，证明其能够可靠地跟踪并预测真实泛化差距的行为。