Behemoth: Benchmarking Unlearning in LLMs Using Fully Synthetic Data

As artificial neural networks, and specifically large language models, have improved rapidly in capabilities and quality, they have increasingly been deployed in real-world applications, from customer service to Google search, despite the fact that they frequently make factually incorrect or undesirable statements. This trend has inspired practical and academic interest in model editing, that is, in adjusting the weights of the model to modify its likely outputs for queries relating to a specific fact or set of facts. This may be done either to amend a fact or set of facts, for instance, to fix a frequent error in the training data, or to suppress a fact or set of facts entirely, for instance, in case of dangerous knowledge. Multiple methods have been proposed to do such edits. However, at the same time, it has been shown that such model editing can be brittle and incomplete. Moreover the effectiveness of any model editing method necessarily depends on the data on which the model is trained, and, therefore, a good understanding of the interaction of the training data distribution and the way it is stored in the network is necessary and helpful to reliably perform model editing. However, working with large language models trained on real-world data does not allow us to understand this relationship or fully measure the effects of model editing. We therefore propose Behemoth, a fully synthetic data generation framework. To demonstrate the practical insights from the framework, we explore model editing in the context of simple tabular data, demonstrating surprising findings that, in some cases, echo real-world results, for instance, that in some cases restricting the update rank results in a more effective update. The code is available at https://github.com/IST-DASLab/behemoth.git.

翻译：随着人工神经网络，特别是大型语言模型在能力和质量上的快速提升，它们越来越多地被部署到现实应用中，从客户服务到谷歌搜索，尽管这些模型经常做出事实错误或不理想的陈述。这一趋势激发了学术界和工业界对模型编辑的实际与学术兴趣，即通过调整模型权重来修改其对特定事实或事实集合相关查询的可能输出。这种编辑可能用于修正一个或一组事实（例如，修复训练数据中的常见错误），或完全抑制一个或一组事实（例如，针对危险知识）。已有多种方法被提出以实现此类编辑。然而，研究表明，这种模型编辑可能具有脆弱性和不完整性。此外，任何模型编辑方法的有效性必然取决于模型训练所用的数据，因此，深入理解训练数据分布及其在网络中的存储方式之间的相互作用，对于可靠执行模型编辑是必要且有益的。然而，使用基于现实世界数据训练的大型语言模型进行研究，无法让我们充分理解这种关系或全面衡量模型编辑的效果。为此，我们提出了Behemoth，一个全合成数据生成框架。为展示该框架的实际洞察力，我们在简单表格数据的背景下探索模型编辑，揭示了一些令人惊讶的发现，这些发现在某些情况下与现实世界的结果相呼应，例如，在某些情况下限制更新秩会导致更有效的更新。代码可在 https://github.com/IST-DASLab/behemoth.git 获取。