Among all data augmentation techniques proposed so far, linear interpolation of training samples, also called Mixup, has found to be effective for a large panel of applications. Along with improved predictive performance, Mixup is also a good technique for improving calibration. However, mixing data carelessly can lead to manifold mismatch, i.e., synthetic data lying outside original class manifolds, which can deteriorate calibration. In this work, we show that the likelihood of assigning a wrong label with mixup increases with the distance between data to mix. To this end, we propose to dynamically change the underlying distributions of interpolation coefficients depending on the similarity between samples to mix, and define a flexible framework to do so without losing in diversity. We provide extensive experiments for classification and regression tasks, showing that our proposed method improves predictive performance and calibration of models, while being much more efficient.
翻译:在迄今提出的所有数据增强技术中,训练样本的线性插值(即Mixup)已被证明在广泛的应用场景中具有显著效果。除提升预测性能外,Mixup同样是改善模型校准的有效技术。然而,盲目的数据混合可能导致流形失配问题——即合成数据可能偏离原始类别流形,进而损害校准效果。本研究证明:混合数据被错误标记的概率随待混合样本间距离的增大而增加。为此,我们提出根据样本间相似度动态调整插值系数的底层分布,并构建了一个在保持多样性的同时实现该目标的灵活框架。通过分类与回归任务的大规模实验验证,本方法在显著提升效率的同时,有效改善了模型的预测性能与校准表现。