Machine learning models that are developed with invariance to certain types of data transformations have demonstrated superior generalization performance in practice. However, the underlying mechanism that explains why invariance leads to better generalization is not well-understood, limiting our ability to select appropriate data transformations for a given dataset. This paper studies the generalization benefit of model invariance by introducing the sample cover induced by transformations, i.e., a representative subset of a dataset that can approximately recover the whole dataset using transformations. Based on this notion, we refine the generalization bound for invariant models and characterize the suitability of a set of data transformations by the sample covering number induced by transformations, i.e., the smallest size of its induced sample covers. We show that the generalization bound can be tightened for suitable transformations that have a small sample covering number. Moreover, our proposed sample covering number can be empirically evaluated, providing a practical guide for selecting transformations to develop model invariance for better generalization. We evaluate the sample covering numbers for commonly used transformations on multiple datasets and demonstrate that the smaller sample covering number for a set of transformations indicates a smaller gap between the test and training error for invariant models, thus validating our propositions.
翻译:具有特定数据变换不变性的机器学习模型在实践中展现出更优的泛化性能。然而,解释不变性为何能带来更好泛化的内在机制尚未被充分理解,这限制了我们在给定数据集上选择合适的变换。本文通过引入变换诱导的样本覆盖(即数据集的一个代表性子集,可通过变换近似恢复整个数据集)来研究模型不变性的泛化优势。基于这一概念,我们改进了不变模型的理论泛化界,并通过变换诱导的样本覆盖数(即其诱导的样本覆盖的最小规模)来刻画一组数据变换的适用性。研究表明,对于具有较小样本覆盖数的合适变换,其泛化界可以更紧。此外,我们提出的样本覆盖数可通过经验评估,为选择变换以实现更好泛化的模型不变性提供了实用指导。我们在多个数据集上评估了常用变换的样本覆盖数,结果表明一组变换的样本覆盖数越小,不变模型的测试误差与训练误差之间的差距越小,从而验证了我们的理论命题。