Deep learning, especially deep neural networks (DNNs), has been widely and successfully adopted in many critical applications for its high effectiveness and efficiency. The rapid development of DNNs has benefited from the existence of some high-quality datasets ($e.g.$, ImageNet), which allow researchers and developers to easily verify the performance of their methods. Currently, almost all existing released datasets require that they can only be adopted for academic or educational purposes rather than commercial purposes without permission. However, there is still no good way to ensure that. In this paper, we formulate the protection of released datasets as verifying whether they are adopted for training a (suspicious) third-party model, where defenders can only query the model while having no information about its parameters and training details. Based on this formulation, we propose to embed external patterns via backdoor watermarking for the ownership verification to protect them. Our method contains two main parts, including dataset watermarking and dataset verification. Specifically, we exploit poison-only backdoor attacks ($e.g.$, BadNets) for dataset watermarking and design a hypothesis-test-guided method for dataset verification. We also provide some theoretical analyses of our methods. Experiments on multiple benchmark datasets of different tasks are conducted, which verify the effectiveness of our method. The code for reproducing main experiments is available at \url{https://github.com/THUYimingLi/DVBW}.
翻译:深度学习,特别是深度神经网络(DNNs),因其高效性和有效性,已被广泛且成功地应用于众多关键领域。DNNs的快速发展得益于一些高质量数据集(例如ImageNet)的存在,这些数据集使研究人员和开发者能够轻松验证其方法的性能。目前,几乎所有已发布的数据集都要求未经许可仅可用于学术或教育目的,而非商业用途。然而,目前尚无有效手段确保这一限制得到遵守。本文将对已发布数据集的保护问题表述为:验证这些数据集是否被用于训练(可疑的)第三方模型,其中防御方只能查询该模型,而对其参数及训练细节一无所知。基于这一表述,我们提出通过后门水印嵌入外部模式,以进行所有权验证来保护数据集。我们的方法包含两个主要部分:数据集水印与数据集验证。具体而言,我们利用仅投毒的后门攻击(例如BadNets)进行数据集水印,并设计了一种基于假设检验的方法进行数据集验证。此外,我们还对所提方法进行了理论分析。在多个不同任务的基准数据集上进行了实验,结果验证了我们方法的有效性。用于复现主要实验的代码已开源至 \url{https://github.com/THUYimingLi/DVBW}。