CUDA: Convolution-based Unlearnable Datasets

Large-scale training of modern deep learning models heavily relies on publicly available data on the web. This potentially unauthorized usage of online data leads to concerns regarding data privacy. Recent works aim to make unlearnable data for deep learning models by adding small, specially designed noises to tackle this issue. However, these methods are vulnerable to adversarial training (AT) and/or are computationally heavy. In this work, we propose a novel, model-free, Convolution-based Unlearnable DAtaset (CUDA) generation technique. CUDA is generated using controlled class-wise convolutions with filters that are randomly generated via a private key. CUDA encourages the network to learn the relation between filters and labels rather than informative features for classifying the clean data. We develop some theoretical analysis demonstrating that CUDA can successfully poison Gaussian mixture data by reducing the clean data performance of the optimal Bayes classifier. We also empirically demonstrate the effectiveness of CUDA with various datasets (CIFAR-10, CIFAR-100, ImageNet-100, and Tiny-ImageNet), and architectures (ResNet-18, VGG-16, Wide ResNet-34-10, DenseNet-121, DeIT, EfficientNetV2-S, and MobileNetV2). Our experiments show that CUDA is robust to various data augmentations and training approaches such as smoothing, AT with different budgets, transfer learning, and fine-tuning. For instance, training a ResNet-18 on ImageNet-100 CUDA achieves only 8.96$\%$, 40.08$\%$, and 20.58$\%$ clean test accuracies with empirical risk minimization (ERM), $L_{\infty}$ AT, and $L_{2}$ AT, respectively. Here, ERM on the clean training data achieves a clean test accuracy of 80.66$\%$. CUDA exhibits unlearnability effect with ERM even when only a fraction of the training dataset is perturbed. Furthermore, we also show that CUDA is robust to adaptive defenses designed specifically to break it.

翻译：现代深度学习模型的大规模训练严重依赖网络上的公开数据。这种对在线数据的潜在未授权使用引发了数据隐私方面的担忧。最近的研究旨在通过添加微小的、专门设计的噪声来为深度学习模型构建不可学习数据，以解决这一问题。然而，这些方法易受对抗训练的影响，且/或计算开销较大。本文提出了一种新颖的、无模型的、基于卷积的不可学习数据集生成技术，称为CUDA。CUDA使用通过私钥随机生成的滤波器进行受控的类级卷积操作生成。CUDA鼓励网络学习滤波器与标签之间的关系，而非从分类干净数据中学习信息性特征。我们进行了理论分析，证明CUDA能够通过降低最优贝叶斯分类器在干净数据上的性能，成功毒化高斯混合数据。我们还通过多种数据集（CIFAR-10、CIFAR-100、ImageNet-100和Tiny-ImageNet）及架构（ResNet-18、VGG-16、Wide ResNet-34-10、DenseNet-121、DeIT、EfficientNetV2-S和MobileNetV2）实验证明了CUDA的有效性。实验表明，CUDA对多种数据增强和训练方法具有鲁棒性，包括平滑处理、不同预算下的对抗训练、迁移学习和微调。例如，在ImageNet-100上使用CUDA训练ResNet-18时，采用经验风险最小化、$L_{\infty}$对抗训练和$L_{2}$对抗训练的干净测试准确率分别仅为8.96$\%$、40.08$\%$和20.58$\%$，而使用干净训练数据的经验风险最小化则能达到80.66$\%$的干净测试准确率。即使只扰动训练数据集中的一小部分，CUDA也能在经验风险最小化下表现出不可学习效果。此外，我们还证明了CUDA对专门为其设计的自适应防御方法具有鲁棒性。