Among all data augmentation techniques proposed so far, linear interpolation of training samples, also called Mixup, has found to be effective for a large panel of applications. Along with improved performance, Mixup is also a good technique for improving calibration and predictive uncertainty. However, mixing data carelessly can lead to manifold intrusion, i.e., conflicts between the synthetic labels assigned and the true label distributions, which can deteriorate calibration. In this work, we argue that the likelihood of manifold intrusion increases with the distance between data to mix. To this end, we propose to dynamically change the underlying distributions of interpolation coefficients depending on the similarity between samples to mix, and define a flexible framework to do so without losing in diversity. We provide extensive experiments for classification and regression tasks, showing that our proposed method improves performance and calibration of models, while being much more efficient. The code for our work is available at https://github.com/qbouniot/sim_kernel_mixup.
翻译:在迄今提出的所有数据增强技术中,训练样本的线性插值(也称为Mixup)已被证明在广泛的应用中行之有效。除了提升性能外,Mixup也是一种改善模型校准和预测不确定性的有效技术。然而,随意混合数据可能导致流形侵入,即所分配的合成标签与真实标签分布之间产生冲突,从而损害校准效果。本文认为,流形侵入的可能性随着待混合数据间距离的增加而增大。为此,我们提出根据待混合样本间的相似度动态调整插值系数的底层分布,并定义了一个灵活的框架来实现这一目标,同时不损失数据多样性。我们在分类和回归任务上进行了大量实验,结果表明,所提出的方法在显著提升效率的同时,改善了模型的性能和校准效果。本工作的代码公开于 https://github.com/qbouniot/sim_kernel_mixup。