Targeted Data Poisoning for Black-Box Audio Datasets Ownership Verification

Protecting the use of audio datasets is a major concern for data owners, particularly with the recent rise of audio deep learning models. While watermarks can be used to protect the data itself, they do not allow to identify a deep learning model trained on a protected dataset. In this paper, we adapt to audio data the recently introduced data taggants approach. Data taggants is a method to verify if a neural network was trained on a protected image dataset with top-$k$ predictions access to the model only. This method relies on a targeted data poisoning scheme by discreetly altering a small fraction (1%) of the dataset as to induce a harmless behavior on out-of-distribution data called keys. We evaluate our method on the Speechcommands and the ESC50 datasets and state of the art transformer models, and show that we can detect the use of the dataset with high confidence without loss of performance. We also show the robustness of our method against common data augmentation techniques, making it a practical method to protect audio datasets.

翻译：保护音频数据集的使用是数据所有者面临的主要问题，尤其是在音频深度学习模型近期兴起的背景下。虽然水印技术可用于保护数据本身，但无法识别在受保护数据集上训练的深度学习模型。本文中，我们将近期提出的数据标记物方法适配至音频数据。数据标记物是一种仅通过模型前$k$预测结果来验证神经网络是否在受保护图像数据集上训练的方法。该方法基于定向数据投毒方案，通过隐秘修改数据集中的一小部分（1%）样本，从而在称为密钥的分布外数据上诱导出无害的行为模式。我们在Speechcommands和ESC50数据集及前沿Transformer模型上评估了该方法，结果表明能够以高置信度检测数据集的使用情况且不损失模型性能。我们还验证了该方法对常见数据增强技术的鲁棒性，证明其可作为保护音频数据集的有效实用方案。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日