We investigate how pair-wise data augmentation techniques like Mixup affect the sample complexity of finding optimal decision boundaries in a binary linear classification problem. For a family of data distributions with a separability constant $\kappa$, we analyze how well the optimal classifier in terms of training loss aligns with the optimal one in test accuracy (i.e., Bayes optimal classifier). For vanilla training without augmentation, we uncover an interesting phenomenon named the curse of separability. As we increase $\kappa$ to make the data distribution more separable, the sample complexity of vanilla training increases exponentially in $\kappa$; perhaps surprisingly, the task of finding optimal decision boundaries becomes harder for more separable distributions. For Mixup training, we show that Mixup mitigates this problem by significantly reducing the sample complexity. To this end, we develop new concentration results applicable to $n^2$ pair-wise augmented data points constructed from $n$ independent data, by carefully dealing with dependencies between overlapping pairs. Lastly, we study other masking-based Mixup-style techniques and show that they can distort the training loss and make its minimizer converge to a suboptimal classifier in terms of test accuracy.
翻译:我们研究成对数据增强技术(如Mixup)如何影响二元线性分类问题中寻找最优决策边界的样本复杂度。针对具有可分性常数$\kappa$的一类数据分布,我们分析了训练损失最优分类器与测试准确率最优分类器(即贝叶斯最优分类器)的对齐程度。在未使用数据增强的标准训练中,我们揭示了一个名为"可分性诅咒"的有趣现象:随着$\kappa$增大使数据分布更具可分性,标准训练的样本复杂度随$\kappa$呈指数级增长;令人惊讶的是,对于更具可分性的分布,找到最优决策边界的任务反而变得更加困难。对于Mixup训练,我们证明Mixup通过显著降低样本复杂度缓解了这一问题。为此,我们开发了适用于由$n$个独立数据构造的$n^2$个成对增强数据点的新浓度结果,并谨慎处理了重叠对之间的依赖关系。最后,我们研究了其他基于掩码的Mixup风格技术,发现它们可能扭曲训练损失,使其最小化器在测试准确率上收敛至次优分类器。