Domain Watermark: Effective and Harmless Dataset Copyright Protection is Closed at Hand

The prosperity of deep neural networks (DNNs) is largely benefited from open-source datasets, based on which users can evaluate and improve their methods. In this paper, we revisit backdoor-based dataset ownership verification (DOV), which is currently the only feasible approach to protect the copyright of open-source datasets. We reveal that these methods are fundamentally harmful given that they could introduce malicious misclassification behaviors to watermarked DNNs by the adversaries. In this paper, we design DOV from another perspective by making watermarked models (trained on the protected dataset) correctly classify some `hard' samples that will be misclassified by the benign model. Our method is inspired by the generalization property of DNNs, where we find a \emph{hardly-generalized domain} for the original dataset (as its \emph{domain watermark}). It can be easily learned with the protected dataset containing modified samples. Specifically, we formulate the domain generation as a bi-level optimization and propose to optimize a set of visually-indistinguishable clean-label modified data with similar effects to domain-watermarked samples from the hardly-generalized domain to ensure watermark stealthiness. We also design a hypothesis-test-guided ownership verification via our domain watermark and provide the theoretical analyses of our method. Extensive experiments on three benchmark datasets are conducted, which verify the effectiveness of our method and its resistance to potential adaptive methods. The code for reproducing main experiments is available at \url{https://github.com/JunfengGo/Domain-Watermark}.

翻译：深度神经网络的繁荣很大程度上得益于开源数据集，用户可基于这些数据集评估和改进其方法。本文重新审视了基于后门的数据集所有权验证（DOV），这是目前保护开源数据集版权的唯一可行方法。我们揭示这些方法本质上存在危害性，因为攻击者可能向添加水印的深度神经网络引入恶意误分类行为。本文从另一个视角设计DOV，使经过水印的模型（在受保护数据集上训练）能够正确分类某些良性模型会误分类的"困难"样本。我们的方法受深度神经网络泛化性质启发，发现原始数据集存在一个"难泛化域"（作为其域水印），该域可通过包含修改样本的受保护数据集轻松学习。具体而言，我们将域生成形式化为双层优化，并提出优化一组视觉上难以区分的干净标签修改数据，使其具有与难泛化域中的域水印样本类似的效果，以确保水印隐蔽性。我们还设计了基于假设检验引导的域水印所有权验证方法，并提供了理论分析。在三个基准数据集上的大量实验验证了该方法的有效性及其对潜在自适应方法的抵抗性。主实验复现代码见：\url{https://github.com/JunfengGo/Domain-Watermark}。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日