Knowledge editing has emerged as an efficient technology for updating the knowledge of large language models (LLMs), attracting increasing attention in recent years. However, there is a lack of effective measures to prevent the malicious misuse of this technology, which could lead to harmful edits in LLMs. These malicious modifications could cause LLMs to generate toxic content, misleading users into inappropriate actions. In front of this risk, we introduce a new task, Knowledge Editing Type Identification (KETI), aimed at identifying different types of edits in LLMs, thereby providing timely alerts to users when encountering illicit edits. As part of this task, we propose KETIBench, which includes five types of harmful edits covering most popular toxic types, as well as one benign factual edit. We develop four classical classification models and three BERT-based models as baseline identifiers for both open-source and closed-source LLMs. Our experimental results, across 42 trials involving two models and three knowledge editing methods, demonstrate that all seven baseline identifiers achieve decent identification performance, highlighting the feasibility of identifying malicious edits in LLMs. Additional analyses reveal that the performance of the identifiers is independent of the reliability of the knowledge editing methods and exhibits cross-domain generalization, enabling the identification of edits from unknown sources. All data and code are available in https://github.com/xpq-tech/KETI. Warning: This paper contains examples of toxic text.
翻译:知识编辑已成为更新大语言模型知识的高效技术,近年来受到越来越多的关注。然而,目前缺乏有效措施来防止该技术的恶意滥用,这可能导致对大语言模型进行有害编辑。这些恶意修改可能致使大语言模型生成有毒内容,误导用户采取不当行为。面对这一风险,我们引入了一项新任务——知识编辑类型识别,旨在识别大语言模型中不同类型的编辑,从而在用户遭遇非法编辑时提供及时预警。作为该任务的一部分,我们提出了KETIBench基准,其中包含覆盖最流行有毒类型的五类有害编辑以及一类良性事实编辑。我们开发了四种经典分类模型和三种基于BERT的模型,作为开源和闭源大语言模型的基线识别器。我们在涉及两个模型和三种知识编辑方法的42次实验中的结果表明,所有七个基线识别器均取得了良好的识别性能,这凸显了在大语言模型中识别恶意编辑的可行性。进一步分析表明,识别器的性能与知识编辑方法的可靠性无关,并展现出跨领域泛化能力,能够识别来自未知来源的编辑。所有数据和代码均可在https://github.com/xpq-tech/KETI获取。警告:本文包含有毒文本示例。