Untargeted Backdoor Watermark: Towards Harmless and Stealthy Dataset Copyright Protection

from arxiv, This work is accepted by the NeurIPS 2022 (selected as Oral paper, TOP 2%). The first two authors contributed equally to this work. 25 pages. We have fixed some typos in the previous version

Deep neural networks (DNNs) have demonstrated their superiority in practice. Arguably, the rapid development of DNNs is largely benefited from high-quality (open-sourced) datasets, based on which researchers and developers can easily evaluate and improve their learning methods. Since the data collection is usually time-consuming or even expensive, how to protect their copyrights is of great significance and worth further exploration. In this paper, we revisit dataset ownership verification. We find that existing verification methods introduced new security risks in DNNs trained on the protected dataset, due to the targeted nature of poison-only backdoor watermarks. To alleviate this problem, in this work, we explore the untargeted backdoor watermarking scheme, where the abnormal model behaviors are not deterministic. Specifically, we introduce two dispersibilities and prove their correlation, based on which we design the untargeted backdoor watermark under both poisoned-label and clean-label settings. We also discuss how to use the proposed untargeted backdoor watermark for dataset ownership verification. Experiments on benchmark datasets verify the effectiveness of our methods and their resistance to existing backdoor defenses. Our codes are available at \url{https://github.com/THUYimingLi/Untargeted_Backdoor_Watermark}.

翻译：深度神经网络（DNN）已在实践中展现出优越性。可以说，DNN的快速发展很大程度上得益于高质量（开源）数据集，研究者与开发者可基于这些数据集轻松评估和改进其学习方法。由于数据收集通常耗时甚至昂贵，如何保护其版权具有重大意义且值得深入探究。本文重新审视了数据集所有权验证问题。我们发现，由于仅投毒后门水印具有目标导向特性，现有验证方法在基于受保护数据集训练的DNN中引入了新的安全风险。为缓解此问题，本研究探索了无目标后门水印方案，其中异常模型行为具有非确定性。具体而言，我们引入了两种离散性指标并证明其相关性，基于此设计了适用于投毒标签和干净标签两种场景的无目标后门水印。我们还讨论了如何利用所提出的无目标后门水印进行数据集所有权验证。在基准数据集上的实验验证了我们方法的有效性及其对现有后门防御的抵抗能力。我们的代码开源在 \url{https://github.com/THUYimingLi/Untargeted_Backdoor_Watermark}。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【USC-Aaron Chan博士答辩Slides】可信自然语言处理机器解释的生成与利用, 242页ppt，Generating and Utilizing Machine Explanations for Trustworthy NLP

专知会员服务

16+阅读 · 2022年3月13日

【CVPR2021】深度稳定学习分布外泛化

专知会员服务

30+阅读 · 2021年5月20日

【ICLR2021】神经元注意力蒸馏消除DNN中的后门触发器

专知会员服务

15+阅读 · 2021年1月31日

【Google】深度学习对抗鲁棒性，43页ppt

专知会员服务

46+阅读 · 2020年10月31日