Recently, people have suffered from LLM hallucination and have become increasingly aware of the reliability gap of LLMs in open and knowledge-intensive tasks. As a result, they have increasingly turned to search-augmented LLMs to mitigate this issue. However, LLM-driven search also becomes an attractive target for misuse. Once the returned content directly contains targeted, ready-to-use harmful instructions or takeaways for users, it becomes difficult to withdraw or undo such exposure. To investigate LLMs' unsafe search behavior issues, we first propose \textbf{\textit{SearchAttack}} for red-teaming, which (1) rephrases harmful semantics via dense and benign knowledge to evade direct in-context decoding, thus eliciting unsafe information retrieval, (2) stress-tests LLMs' reward-chasing bias by steering them to synthesize unsafe retrieved content. We also curate an emergent, domain-specific illicit activity benchmark for search-based threat assessment, and introduce a fact-checking framework to ground and quantify harm in both offline and online attack settings. Extensive experiments are conducted to red-team the search-augmented LLMs for responsible vulnerability assessment. Empirically, SearchAttack demonstrates strong effectiveness in attacking these systems. We also find that LLMs without web search can still be steered into harmful content output due to their information-seeking stereotypical behaviors.
翻译:近期,人们饱受大型语言模型幻觉问题困扰,并日益认识到其在开放域知识密集型任务中的可靠性差距。因此,他们越来越多地转向搜索增强型大型语言模型以缓解此问题。然而,由大型语言模型驱动的搜索也成为了滥用的诱人目标。一旦返回内容直接包含针对性的、可供用户直接使用的有害指令或要点,此类信息暴露将难以撤回或消除。为探究大型语言模型的不安全搜索行为问题,我们首先提出用于红队测试的\textbf{\textit{SearchAttack}}框架,该框架(1)通过密集且良性的知识重述有害语义以规避直接上下文解码,从而诱导不安全信息检索;(2)通过引导模型合成不安全检索内容,压力测试大型语言模型的奖励追逐偏差。我们还构建了一个新兴的、领域特定的非法活动基准用于基于搜索的威胁评估,并引入事实核查框架以在离线和在线攻击场景中锚定和量化危害。我们进行了大量实验,对搜索增强型大型语言模型进行红队测试以进行负责任的安全漏洞评估。实证表明,SearchAttack在攻击这些系统方面表现出强大效力。我们还发现,即使不具备网络搜索功能的大型语言模型,由于其信息寻求的刻板行为,仍可能被引导输出有害内容。