Mixup is a widely adopted data augmentation technique known for enhancing the generalization of machine learning models by interpolating between data points. Despite its success and popularity, limited attention has been given to understanding the statistical properties of the synthetic data it generates. In this paper, we delve into the theoretical underpinnings of mixup, specifically its effects on the statistical structure of synthesized data. We demonstrate that while mixup improves model performance, it can distort key statistical properties such as variance, potentially leading to unintended consequences in data synthesis. To address this, we propose a novel mixup method that incorporates a generalized and flexible weighting scheme, better preserving the original data's structure. Through theoretical developments, we provide conditions under which our proposed method maintains the (co)variance and distributional properties of the original dataset. Numerical experiments confirm that the new approach not only preserves the statistical characteristics of the original data but also sustains model performance across repeated synthesis, alleviating concerns of model collapse identified in previous research.
翻译:Mixup是一种广泛采用的数据增强技术,通过数据点之间的插值来提升机器学习模型的泛化能力。尽管其取得了成功并广受欢迎,但对其生成合成数据的统计特性的理解仍关注有限。本文深入探讨了mixup的理论基础,特别是其对合成数据统计结构的影响。我们证明,虽然mixup能提升模型性能,但它可能扭曲方差等关键统计特性,从而在数据合成中导致非预期的后果。为解决此问题,我们提出了一种新颖的mixup方法,该方法引入了一种广义且灵活的加权方案,能更好地保持原始数据的结构。通过理论推导,我们给出了所提方法保持原始数据集(协)方差与分布特性的条件。数值实验证实,新方法不仅保留了原始数据的统计特征,还能在重复合成过程中维持模型性能,缓解了先前研究中发现的模型崩溃问题。