Identifying the training datasets that influence a language model's outputs is essential for minimizing the generation of harmful content and enhancing its performance. Ideally, we can measure the influence of each dataset by removing it from training; however, it is prohibitively expensive to retrain a model multiple times. This paper presents UnTrac: unlearning traces the influence of a training dataset on the model's performance. UnTrac is extremely simple; each training dataset is unlearned by gradient ascent, and we evaluate how much the model's predictions change after unlearning. Furthermore, we propose a more scalable approach, UnTrac-Inv, which unlearns a test dataset and evaluates the unlearned model on training datasets. UnTrac-Inv resembles UnTrac, while being efficient for massive training datasets. In the experiments, we examine if our methods can assess the influence of pretraining datasets on generating toxic, biased, and untruthful content. Our methods estimate their influence much more accurately than existing methods while requiring neither excessive memory space nor multiple checkpoints.
翻译:识别影响语言模型输出的训练数据集对于最小化有害内容生成和提升模型性能至关重要。理想情况下,我们可以通过从训练中移除每个数据集来衡量其影响;然而,多次重新训练模型的成本过高。本文提出UnTrac:通过遗忘机制追踪训练数据集对模型性能的影响。UnTrac方法极为简单:通过梯度上升对每个训练数据集进行遗忘操作,并评估遗忘后模型预测的变化程度。此外,我们提出了一种更具可扩展性的方法UnTrac-Inv,该方法对测试数据集进行遗忘,并在训练数据集上评估遗忘后的模型。UnTrac-Inv与UnTrac原理相似,但对海量训练数据集具有更高效率。实验中,我们验证了所提方法能否有效评估预训练数据集对生成毒性、偏见及虚假内容的影响。相较于现有方法,我们的方法在无需过多内存空间或多个检查点的前提下,能更准确地估计这些数据集的影响。