Data augmentation techniques, such as simple image transformations and combinations, are highly effective at improving the generalization of computer vision models, especially when training data is limited. However, such techniques are fundamentally incompatible with differentially private learning approaches, due to the latter's built-in assumption that each training image's contribution to the learned model is bounded. In this paper, we investigate why naive applications of multi-sample data augmentation techniques, such as mixup, fail to achieve good performance and propose two novel data augmentation techniques specifically designed for the constraints of differentially private learning. Our first technique, DP-Mix_Self, achieves SoTA classification performance across a range of datasets and settings by performing mixup on self-augmented data. Our second technique, DP-Mix_Diff, further improves performance by incorporating synthetic data from a pre-trained diffusion model into the mixup process. We open-source the code at https://github.com/wenxuan-Bao/DP-Mix.
翻译:数据增强技术(如简单的图像变换与组合)在提升计算机视觉模型泛化能力方面非常有效,尤其在训练数据有限时。然而,此类技术与差分隐私学习方法存在根本性不兼容,因为后者内在地假设每张训练图像对学习模型的贡献是有界的。本文探究了为何多样本数据增强技术(如混合)的朴素应用无法取得良好性能,并提出两种专为差分隐私学习约束设计的新型数据增强方法。第一种技术DP-Mix_Self通过在自增强数据上执行混合操作,在多种数据集和场景下实现了最先进的分类性能。第二种技术DP-Mix_Diff则进一步将预训练扩散模型生成的合成数据融入混合过程,从而提升了性能。我们已在https://github.com/wenxuan-Bao/DP-Mix开源相关代码。