Data Isotopes for Data Provenance in DNNs

Today, creators of data-hungry deep neural networks (DNNs) scour the Internet for training fodder, leaving users with little control over or knowledge of when their data is appropriated for model training. To empower users to counteract unwanted data use, we design, implement and evaluate a practical system that enables users to detect if their data was used to train an DNN model. We show how users can create special data points we call isotopes, which introduce "spurious features" into DNNs during training. With only query access to a trained model and no knowledge of the model training process, or control of the data labels, a user can apply statistical hypothesis testing to detect if a model has learned the spurious features associated with their isotopes by training on the user's data. This effectively turns DNNs' vulnerability to memorization and spurious correlations into a tool for data provenance. Our results confirm efficacy in multiple settings, detecting and distinguishing between hundreds of isotopes with high accuracy. We further show that our system works on public ML-as-a-service platforms and larger models such as ImageNet, can use physical objects instead of digital marks, and remains generally robust against several adaptive countermeasures.

翻译：当今，渴求数据的深度神经网络（DNN）创建者从互联网上搜罗训练素材，导致用户对其数据何时被用于模型训练几乎无法控制或知晓。为赋予用户抵御未经授权的数据使用的能力，我们设计、实现并评估了一套实用系统，使用户能够检测自身数据是否被用于训练DNN模型。我们展示了用户如何创建称为"同位素"的特殊数据点，这些数据点在训练过程中会向DNN引入"伪特征"。用户仅需对训练后的模型进行查询访问，无需了解模型训练过程或控制数据标签，即可通过统计假设检验检测模型是否基于用户数据训练而学习到与同位素相关的伪特征。这实质上将DNN对记忆和伪相关性的脆弱性转化为数据溯源的利器。实验结果表明，该系统在多种场景下均具有效性，能够以高精度检测并区分数百种同位素。我们进一步证明，该系统可在公共机器学习服务平台及ImageNet等大型模型中运行，可采用物理对象替代数字标记，并且对多种自适应对抗措施保持总体鲁棒性。

相关内容

MoDELS

关注 0

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

专知会员服务

105+阅读 · 2022年2月10日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日