Large language models (LLMs) have gained increasing prominence in artificial intelligence, making a profound impact on society and various industries like business and science. However, the presence of false information on the internet and in text corpus poses a significant risk to the reliability and safety of LLMs, underscoring the urgent need to understand the mechanisms of how false information influences the behaviors of LLMs. In this paper, we dive into this problem and investigate how false information spreads in LLMs and affects related responses. Specifically, in our series of experiments, we investigate different factors that can influence the spread of information in LLMs by comparing three degrees of information relevance (direct, indirect, and peripheral), four information source styles (Twitter, web blogs, news reports, and research papers) and two common knowledge injection paradigms (in-context injection and learning-based injection). The experimental results show that (1)False information will spread and contaminate related memories in LLMs via a semantic diffusion process, i.e., false information has global detrimental effects beyond its direct impact. (2)Current LLMs are susceptible to authority bias, i.e., LLMs are more likely to follow false information presented in trustworthy styles such as news reports and research papers, which usually cause deeper and wider pollution of information. (3)Current LLMs are more sensitive to false information through in-context injection than through learning-based injection, which severely challenges the reliability and safety of LLMs even when all training data are trusty and correct. The above findings raise the need for new false information defense algorithms to address the global impact of false information, and new alignment algorithms to unbiasedly lead LLMs to follow essential human values rather than superficial patterns.
翻译:大语言模型(LLMs)在人工智能领域中日益凸显其重要性,对商业、科学等各类行业及社会产生了深远影响。然而,互联网及文本语料库中存在的虚假信息对LLMs的可靠性和安全性构成了重大风险,迫切需要我们理解虚假信息影响LLMs行为的内在机制。本文深入探究该问题,考察虚假信息如何在LLMs中传播并影响相关响应。具体而言,我们设计了一系列实验,通过比较三种信息关联程度(直接关联、间接关联和边缘关联)、四种信息来源风格(推特、网络博客、新闻报道和研究论文)以及两种常见知识注入范式(上下文注入和基于学习的注入),研究了影响LLMs中信息传播的多种因素。实验结果表明:(1)虚假信息会通过语义扩散过程在LLMs中传播并污染相关记忆,即虚假信息会超出直接影响范围,产生全局性有害效应;(2)当前LLMs易受权威性偏差影响,即LLMs更倾向于遵从以新闻报道和研究论文等可信风格呈现的虚假信息,这类信息通常会导致更深层、更广泛的信息污染;(3)当前LLMs对通过上下文注入方式输入的虚假信息比基于学习注入方式更为敏感,这严重挑战了LLMs的可靠性和安全性,即使所有训练数据均可信且正确。上述发现表明,需要设计新的虚假信息防御算法以应对虚假信息的全局影响,并开发新的对齐算法,使LLMs能无偏地遵循人类核心价值观而非表面模式。