It is important that consumers and regulators can verify the provenance of large neural models to evaluate their capabilities and risks. We introduce the concept of a "Proof-of-Training-Data": any protocol that allows a model trainer to convince a Verifier of the training data that produced a set of model weights. Such protocols could verify the amount and kind of data and compute used to train the model, including whether it was trained on specific harmful or beneficial data sources. We explore efficient verification strategies for Proof-of-Training-Data that are compatible with most current large-model training procedures. These include a method for the model-trainer to verifiably pre-commit to a random seed used in training, and a method that exploits models' tendency to temporarily overfit to training data in order to detect whether a given data-point was included in training. We show experimentally that our verification procedures can catch a wide variety of attacks, including all known attacks from the Proof-of-Learning literature.
翻译:确保消费者和监管机构能够验证大型神经模型的来源,以评估其能力与风险至关重要。我们提出"训练数据证明"概念:即允许模型训练者向验证者证明产生一组模型权重的训练数据的任意协议。此类协议可验证用于训练模型的数据量、数据类型及其计算资源,包括模型是否在特定有害或有益数据源上训练。我们探索了与当前主流大型模型训练流程兼容的高效验证策略,包括一种让模型训练者可验证地预先承诺训练中使用的随机种子的方法,以及一种利用模型对训练数据临时过度拟合的倾向来检测特定数据点是否包含在训练中的方法。实验表明,我们的验证流程能够捕获多种攻击,包括已知的"学习证明"文献中的所有攻击。