We consider the problem of a training data proof, where a data creator or owner wants to demonstrate to a third party that some machine learning model was trained on their data. Training data proofs play a key role in recent lawsuits against foundation models trained on web-scale data. Many prior works suggest to instantiate training data proofs using membership inference attacks. We argue that this approach is fundamentally unsound: to provide convincing evidence, the data creator needs to demonstrate that their attack has a low false positive rate, i.e., that the attack's output is unlikely under the null hypothesis that the model was not trained on the target data. Yet, sampling from this null hypothesis is impossible, as we do not know the exact contents of the training set, nor can we (efficiently) retrain a large foundation model. We conclude by offering two paths forward, by showing that data extraction attacks and membership inference on special canary data can be used to create sound training data proofs.
翻译:我们考虑训练数据证明问题,即数据创建者或所有者希望向第三方证明某个机器学习模型曾使用其数据进行训练。训练数据证明在近期针对基于网络规模数据训练的基础模型的诉讼中起着关键作用。许多先前研究建议采用成员推理攻击来实现训练数据证明。我们认为这种方法本质上是不可靠的:为提供令人信服的证据,数据创建者需要证明其攻击具有较低的误报率,即在模型未使用目标数据训练的零假设下,攻击输出结果出现的可能性极低。然而,从该零假设中进行抽样是不可能的,因为我们既不知道训练集的确切内容,也无法(高效地)重新训练大型基础模型。最后我们提出两条可行路径:通过展示数据提取攻击以及对特殊标记数据的成员推理,可用于构建可靠的训练数据证明。