Large Language Models (LLMs) are increasingly integrated into various domains, making knowledge editing techniques crucial yet potentially hazardous. Current editing methods primarily target atomic facts, overlooking the significant risks associated with manipulating factual opinions, e.g., documented stances of public figures on societal issues. Such manipulation could reshape public images, influence elections, and alter societal views. To systematically assess this threat, we introduce the Factual Opinion Editing with Evidence (FOE) benchmark, which encompasses 261 public figures, 19 issue categories, and 2,178 complete opinion records. Our evaluations demonstrate that current editing techniques struggle significantly with factual opinions, often achieving only superficial changes while failing to preserve consistency between the edited opinion and the supporting evidence generated by the model. To address this limitation, we further propose a simple yet effective Self-Generated Evidence-Aligned method that achieves opinion-evidence alignment without relying on explicit instructions. Together, our benchmark and method provide a foundation for understanding the emerging security implications of factual opinion editing in LLMs.
翻译:大型语言模型(LLMs)正在越来越多地融入各个领域,这使得知识编辑技术变得至关重要,但也可能带来风险。当前的编辑方法主要针对原子事实,忽视了操纵事实性观点(例如,公众人物对社会问题的公开立场)所带来的重大风险。此类操纵可能重塑公众形象、影响选举并改变社会观念。为系统评估这一威胁,我们引入了基于证据的事实性观点编辑(FOE)基准,该基准涵盖261位公众人物、19个议题类别以及2178条完整的观点记录。我们的评估表明,当前编辑技术在处理事实性观点时存在显著困难,通常只能实现表面更改,而无法保持编辑后的观点与模型生成的支撑证据之间的一致性。为解决此局限,我们进一步提出了一种简单而有效的自生成证据对齐方法,该方法无需依赖显式指令即可实现观点与证据的对齐。我们的基准和方法共同为理解LLMs中事实性观点编辑的新兴安全影响奠定了基础。