Unlearning Traces the Influential Training Data of Language Models

Identifying the training datasets that influence a language model's outputs is essential for minimizing the generation of harmful content and enhancing its performance. Ideally, we can measure the influence of each dataset by removing it from training; however, it is prohibitively expensive to retrain a model multiple times. This paper presents UnTrac: unlearning traces the influence of a training dataset on the model's performance. UnTrac is extremely simple; each training dataset is unlearned by gradient ascent, and we evaluate how much the model's predictions change after unlearning. Furthermore, we propose a more scalable approach, UnTrac-Inv, which unlearns a test dataset and evaluates the unlearned model on training datasets. UnTrac-Inv resembles UnTrac, while being efficient for massive training datasets. In the experiments, we examine if our methods can assess the influence of pretraining datasets on generating toxic, biased, and untruthful content. Our methods estimate their influence much more accurately than existing methods while requiring neither excessive memory space nor multiple checkpoints.

翻译：识别影响语言模型输出的训练数据集对于最小化有害内容生成和提升模型性能至关重要。理想情况下，我们可以通过从训练中移除每个数据集来衡量其影响；然而，多次重新训练模型的成本过高。本文提出UnTrac：通过遗忘机制追踪训练数据集对模型性能的影响。UnTrac方法极为简单：通过梯度上升对每个训练数据集进行遗忘操作，并评估遗忘后模型预测的变化程度。此外，我们提出了一种更具可扩展性的方法UnTrac-Inv，该方法对测试数据集进行遗忘，并在训练数据集上评估遗忘后的模型。UnTrac-Inv与UnTrac原理相似，但对海量训练数据集具有更高效率。实验中，我们验证了所提方法能否有效评估预训练数据集对生成毒性、偏见及虚假内容的影响。相较于现有方法，我们的方法在无需过多内存空间或多个检查点的前提下，能更准确地估计这些数据集的影响。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/