Black-box Dataset Ownership Verification via Backdoor Watermarking

from arxiv, This paper is accepted by IEEE TIFS. 15 pages. The preliminary short version of this paper was posted on arXiv (arXiv:2010.05821) and presented in a non-archival NeurIPS Workshop (2020)

Deep learning, especially deep neural networks (DNNs), has been widely and successfully adopted in many critical applications for its high effectiveness and efficiency. The rapid development of DNNs has benefited from the existence of some high-quality datasets ($e.g.$, ImageNet), which allow researchers and developers to easily verify the performance of their methods. Currently, almost all existing released datasets require that they can only be adopted for academic or educational purposes rather than commercial purposes without permission. However, there is still no good way to ensure that. In this paper, we formulate the protection of released datasets as verifying whether they are adopted for training a (suspicious) third-party model, where defenders can only query the model while having no information about its parameters and training details. Based on this formulation, we propose to embed external patterns via backdoor watermarking for the ownership verification to protect them. Our method contains two main parts, including dataset watermarking and dataset verification. Specifically, we exploit poison-only backdoor attacks ($e.g.$, BadNets) for dataset watermarking and design a hypothesis-test-guided method for dataset verification. We also provide some theoretical analyses of our methods. Experiments on multiple benchmark datasets of different tasks are conducted, which verify the effectiveness of our method. The code for reproducing main experiments is available at \url{https://github.com/THUYimingLi/DVBW}.

翻译：深度学习，特别是深度神经网络（DNNs），因其高效性和有效性，已被广泛且成功地应用于众多关键领域。DNNs的快速发展得益于一些高质量数据集（例如ImageNet）的存在，这些数据集使研究人员和开发者能够轻松验证其方法的性能。目前，几乎所有已发布的数据集都要求未经许可仅可用于学术或教育目的，而非商业用途。然而，目前尚无有效手段确保这一限制得到遵守。本文将对已发布数据集的保护问题表述为：验证这些数据集是否被用于训练（可疑的）第三方模型，其中防御方只能查询该模型，而对其参数及训练细节一无所知。基于这一表述，我们提出通过后门水印嵌入外部模式，以进行所有权验证来保护数据集。我们的方法包含两个主要部分：数据集水印与数据集验证。具体而言，我们利用仅投毒的后门攻击（例如BadNets）进行数据集水印，并设计了一种基于假设检验的方法进行数据集验证。此外，我们还对所提方法进行了理论分析。在多个不同任务的基准数据集上进行了实验，结果验证了我们方法的有效性。用于复现主要实验的代码已开源至 \url{https://github.com/THUYimingLi/DVBW}。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【2023新书】实用数据隐私:增强数据的隐私性和安全性，599页pdf

专知会员服务

83+阅读 · 2023年5月1日

67页PPT【ML+气象】使用机器学习技术对季节和次季节研究和预测，Use of Machine Learning Techniques for Seasonal and Subseasonal Studies and Predictions

专知会员服务

19+阅读 · 2022年3月4日

ICLR2021放榜了！ 687篇入选34篇得满分！ 48篇orals，108篇spotlights，531篇poster

专知会员服务

24+阅读 · 2021年1月13日

【Google】深度学习对抗鲁棒性，43页ppt

专知会员服务

47+阅读 · 2020年10月31日