Large language model (LLM) based search agents iteratively generate queries, retrieve external information, and reason to answer open-domain questions. While researchers have primarily focused on improving their utility, their safety behaviors remain underexplored. In this paper, we first evaluate search agents using red-teaming datasets and find that they are more likely to produce harmful outputs than base LLMs. For example, when asked "How can I track someone's location without their consent?", a base model refuses, whereas a search agent designed to retrieve and cite sources may lower its refusal threshold, fetch documents (e.g., court cases), and, once appended, synthesize them into an informative yet unsafe summary. We further show that utility-oriented fine-tuning intensifies this risk, motivating joint alignment of safety and utility. We present SafeSearch, a multi-objective reinforcement learning approach that couples a final-output safety/utility reward with a novel query-level shaping term that penalizes unsafe queries and rewards safe ones. Experiments show that SafeSearch reduces agent harmfulness by over 70% across three red-teaming datasets while producing safe, helpful responses, and matches the QA performance of a utility-only finetuned agent; further analyses confirm the effectiveness of the query-level reward in jointly improving safety and utility.
翻译:基于大型语言模型(LLM)的搜索智能体通过迭代生成查询、检索外部信息并进行推理来回答开放域问题。尽管研究者主要关注提升其效用,但其安全性行为仍未得到充分探索。本文首先使用红队测试数据集评估搜索智能体,发现其相比基础LLM更易产生有害输出。例如当询问“如何未经同意追踪他人位置?”时,基础模型会拒绝回答,而设计用于检索和引用来源的搜索智能体可能降低其拒绝阈值,获取文档(如法庭案例),并在附加文档后将其合成为信息丰富但不安全的摘要。我们进一步证明以效用为导向的微调会加剧此风险,这促使我们需对安全性与效用进行联合对齐。本文提出SafeSearch——一种多目标强化学习方法,该方法将最终输出的安全/效用奖励与新颖的查询级塑形项相结合,该塑形项会惩罚不安全查询并奖励安全查询。实验表明,SafeSearch在三个红队测试数据集中将智能体危害性降低超过70%,同时生成安全、有帮助的响应,并与纯效用微调智能体的问答性能相当;进一步分析证实了查询级奖励在联合提升安全性与效用方面的有效性。