Large language models (LLMs) possess a wealth of knowledge encoded in their parameters. However, this knowledge may become outdated or unsuitable over time. As a result, there has been a growing interest in knowledge editing for LLMs and evaluating its effectiveness. Existing studies primarily focus on knowledge editing using factual triplets, which not only incur high costs for collection but also struggle to express complex facts. Furthermore, these studies are often limited in their evaluation perspectives. In this paper, we propose Eva-KELLM, a new benchmark for evaluating knowledge editing of LLMs. This benchmark includes an evaluation framework and a corresponding dataset. Under our framework, we first ask the LLM to perform knowledge editing using raw documents, which provides a more convenient and universal approach compared to using factual triplets. We then evaluate the updated LLM from multiple perspectives. In addition to assessing the effectiveness of knowledge editing and the retention of unrelated knowledge from conventional studies, we further test the LLM's ability in two aspects: 1) Reasoning with the altered knowledge, aiming for the LLM to genuinely learn the altered knowledge instead of simply memorizing it. 2) Cross-lingual knowledge transfer, where the LLM updated with raw documents in one language should be capable of handling queries from another language. To facilitate further research, we construct and release the corresponding dataset. Using this benchmark, we investigate the effectiveness of several commonly-used knowledge editing methods. Experimental results indicate that the current methods for knowledge editing using raw documents are not effective in yielding satisfactory results, particularly when it comes to reasoning with altered knowledge and cross-lingual knowledge transfer.
翻译:大语言模型(LLMs)在其参数中编码了大量知识。然而,这些知识可能随时间推移变得过时或不再适用。因此,针对LLMs的知识编辑及其有效性评估引起了广泛关注。现有研究主要采用事实三元组进行知识编辑,这不仅导致收集成本高昂,而且难以表达复杂事实。此外,这些研究的评估视角往往较为局限。本文提出Eva-KELLM——一种评估LLMs知识编辑的新基准。该基准包含评估框架及相应数据集。在我们的框架下,首先要求LLM使用原始文档进行知识编辑,相比事实三元组方法,这种方式更加便捷且更具普适性。随后,我们从多个维度对更新后的LLM进行评估。除评估知识编辑效果和保持无关知识能力等传统研究指标外,我们进一步测试LLM在以下两方面的能力:1)基于变更知识进行推理,旨在使LLM真正习得而非机械记忆变更内容;2)跨语言知识迁移,即使用一种语言原始文档更新的LLM应能处理另一种语言的查询。为促进后续研究,我们构建并发布了相应数据集。基于此基准,我们考察了数种常用知识编辑方法的有效性。实验结果表明,当前使用原始文档进行知识编辑的方法难以取得满意效果,特别是在基于变更知识的推理与跨语言知识迁移方面表现尤为不足。