The rise of deep learning (DL) has led to a surging demand for training data, which incentivizes the creators of DL models to trawl through the Internet for training materials. Meanwhile, users often have limited control over whether their data (e.g., facial images) are used to train DL models without their consent, which has engendered pressing concerns. This work proposes MembershipTracker, a practical data provenance tool that can empower ordinary users to take agency in detecting the unauthorized use of their data in training DL models. We view tracing data provenance through the lens of membership inference (MI). MembershipTracker consists of a lightweight data marking component to mark the target data with small and targeted changes, which can be strongly memorized by the model trained on them; and a specialized MI-based verification process to audit whether the model exhibits strong memorization on the target samples. Overall, MembershipTracker only requires the users to mark a small fraction of data (0.005% to 0.1% in proportion to the training set), and it enables the users to reliably detect the unauthorized use of their data (average 0% FPR@100% TPR). We show that MembershipTracker is highly effective across various settings, including industry-scale training on the full-size ImageNet-1k dataset. We finally evaluate MembershipTracker under multiple classes of countermeasures.
翻译:深度学习(DL)的兴起导致对训练数据的需求激增,这促使DL模型的创建者从互联网上大量搜集训练材料。与此同时,用户通常对其数据(例如面部图像)是否在未经其同意的情况下被用于训练DL模型控制有限,这引发了紧迫的担忧。本研究提出了MembershipTracker,一种实用的数据溯源工具,能够使普通用户有能力检测其数据在训练DL模型时是否被未经授权使用。我们通过成员推理(MI)的视角来看待数据溯源问题。MembershipTracker包含一个轻量级的数据标记组件,用于对目标数据进行微小且有针对性(targeted)的修改,这些修改可以被在其上训练的模型强烈记忆;以及一个专门的基于MI的验证过程,用于审计模型是否对目标样本表现出强烈的记忆。总体而言,MembershipTracker仅要求用户标记一小部分数据(占训练集比例的0.005%至0.1%),并且使用户能够可靠地检测其数据的未经授权使用(平均在100%真正率下的假正率为0%)。我们证明MembershipTracker在各种设置下都非常有效,包括在完整尺寸的ImageNet-1k数据集上进行工业规模的训练。最后,我们在多类对抗措施下评估了MembershipTracker。