Did You Train on My Dataset? Towards Public Dataset Protection with Clean-Label Backdoor Watermarking

The huge supporting training data on the Internet has been a key factor in the success of deep learning models. However, this abundance of public-available data also raises concerns about the unauthorized exploitation of datasets for commercial purposes, which is forbidden by dataset licenses. In this paper, we propose a backdoor-based watermarking approach that serves as a general framework for safeguarding public-available data. By inserting a small number of watermarking samples into the dataset, our approach enables the learning model to implicitly learn a secret function set by defenders. This hidden function can then be used as a watermark to track down third-party models that use the dataset illegally. Unfortunately, existing backdoor insertion methods often entail adding arbitrary and mislabeled data to the training set, leading to a significant drop in performance and easy detection by anomaly detection algorithms. To overcome this challenge, we introduce a clean-label backdoor watermarking framework that uses imperceptible perturbations to replace mislabeled samples. As a result, the watermarking samples remain consistent with the original labels, making them difficult to detect. Our experiments on text, image, and audio datasets demonstrate that the proposed framework effectively safeguards datasets with minimal impact on original task performance. We also show that adding just 1% of watermarking samples can inject a traceable watermarking function and that our watermarking samples are stealthy and look benign upon visual inspection.

翻译：互联网上庞大的训练数据是深度学习模型成功的关键因素之一。然而，这些公开可用的数据也引发了对数据集未经授权用于商业目的的担忧，而这正是数据集许可证所禁止的。本文提出了一种基于后门的水印方法，作为保护公开可用数据的通用框架。通过在数据集中插入少量水印样本，该方法使学习模型能够隐式学习防御者设定的秘密函数。该隐藏函数随后可用作水印，追踪非法使用数据集的第三方模型。遗憾的是，现有后门插入方法通常会在训练集中添加任意且错误标记的数据，导致性能显著下降，并容易被异常检测算法发现。为克服这一挑战，我们引入了一种干净标签后门水印框架，利用不可察觉的扰动替换错误标记样本。因此，水印样本与原始标签保持一致，难以被检测。我们在文本、图像和音频数据集上的实验表明，所提框架有效保护了数据集，且对原始任务性能影响极小。我们还证明，仅添加1%的水印样本即可注入可追踪的水印函数，且水印样本具有隐蔽性，视觉检查时看似良性。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR2023】带有噪声标签的孪生对比学习

专知会员服务

33+阅读 · 2023年3月16日

【CVPR 2022】基于双噪声标签的可见光-红外人再识别学习，Learning with Twin Noisy Labels for Visible-Infrared Person Re-Identification

专知会员服务

14+阅读 · 2022年3月28日

【Hugging Face】使用自定义数据集微调语义分割模型，Fine-Tune a Semantic Segmentation Model with a Custom Dataset

专知会员服务

21+阅读 · 2022年3月18日

【CVPR 2022】可转移的稀疏对抗性攻击，Transferable Sparse Adversarial Attack

专知会员服务

15+阅读 · 2022年3月12日