CBW：基于聚类后门水印的说话人验证数据集所有权验证方法 (CBW: Towards Dataset Ownership Verification for Speaker Verification via Clustering-based Backdoor Watermarking)

With the increasing adoption of deep learning in speaker verification, large-scale speech datasets have become valuable intellectual property. To audit and prevent the unauthorized usage of these valuable released datasets, especially in commercial or open-source scenarios, we propose a novel dataset ownership verification method. Our approach introduces a clustering-based backdoor watermark (CBW), enabling dataset owners to determine whether a suspicious third-party model has been trained on a protected dataset under a black-box setting. The CBW method consists of two key stages: dataset watermarking and ownership verification. During watermarking, we implant multiple trigger patterns in the dataset to make similar samples (measured by their feature similarities) close to the same trigger while dissimilar samples are near different triggers. This ensures that any model trained on the watermarked dataset exhibits specific misclassification behaviors when exposed to trigger-embedded inputs. To verify dataset ownership, we design a hypothesis-test-based framework that statistically evaluates whether a suspicious model exhibits the expected backdoor behavior. We conduct extensive experiments on benchmark datasets, verifying the effectiveness and robustness of our method against potential adaptive attacks. The code for reproducing main experiments is available at https://github.com/Radiant0726/CBW

翻译：随着深度学习在说话人验证领域的广泛应用，大规模语音数据集已成为重要的知识产权资产。为审计并防止这些已发布高价值数据集在商业或开源场景中的未授权使用，本文提出一种新颖的数据集所有权验证方法。该方法通过引入基于聚类的后门水印（CBW），使数据集所有者能够在黑盒设置下判定可疑第三方模型是否使用了受保护数据集进行训练。CBW方法包含两个关键阶段：数据集水印嵌入和所有权验证。在水印嵌入阶段，我们在数据集中植入多个触发模式，使得特征相似的样本聚集于相同触发模式，而特征差异较大的样本则靠近不同触发模式。这确保任何在水印数据集上训练的模型在接收触发嵌入输入时都会表现出特定的误分类行为。为验证数据集所有权，我们设计了基于假设检验的验证框架，通过统计方法评估可疑模型是否表现出预期的后门行为。我们在多个基准数据集上进行了广泛实验，验证了本方法在应对潜在自适应攻击时的有效性和鲁棒性。主要实验复现代码已发布于 https://github.com/Radiant0726/CBW

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日