Persuasion is a powerful capability of large language models (LLMs) that both enables beneficial applications (e.g. helping people quit smoking) and raises significant risks (e.g. large-scale, targeted political manipulation). Prior work has found models possess a significant and growing persuasive capability, measured by belief changes in simulated or real users. However, these benchmarks overlook a crucial risk factor: the propensity of a model to attempt to persuade in harmful contexts. Understanding whether a model will blindly ``follow orders'' to persuade on harmful topics (e.g. glorifying joining a terrorist group) is key to understanding the efficacy of safety guardrails. Moreover, understanding if and when a model will engage in persuasive behavior in pursuit of some goal is essential to understanding the risks from agentic AI systems. We propose the Attempt to Persuade Eval (APE) benchmark, that shifts the focus from persuasion success to persuasion attempts, operationalized as a model's willingness to generate content aimed at shaping beliefs or behavior. Our evaluation framework probes frontier LLMs using a multi-turn conversational setup between simulated persuader and persuadee agents. APE explores a diverse spectrum of topics including conspiracies, controversial issues, and non-controversially harmful content. We introduce an automated evaluator model to identify willingness to persuade and measure the frequency and context of persuasive attempts. We find that many open and closed-weight models are frequently willing to attempt persuasion on harmful topics and that jailbreaking can increase willingness to engage in such behavior. Our results highlight gaps in current safety guardrails and underscore the importance of evaluating willingness to persuade as a key dimension of LLM risk. APE is available at github.com/AlignmentResearch/AttemptPersuadeEval
翻译:说服能力是大语言模型(LLM)的一项强大功能,既能实现有益应用(例如帮助人们戒烟),也带来重大风险(例如大规模、有针对性的政治操纵)。先前研究发现,模型具有显著且不断增强的说服能力,这一结论通过模拟或真实用户的信念改变得以衡量。然而,这些基准测试忽略了一个关键风险因素:模型在有害情境下尝试说服的倾向。理解模型是否会盲目“服从指令”对有害话题(例如美化加入恐怖组织)进行说服,是评估安全防护措施有效性的关键。此外,理解模型是否以及何时会为实现特定目标而采取说服行为,对于理解智能体AI系统的风险至关重要。我们提出“说服尝试评估”(APE)基准,将关注点从说服成功率转向说服尝试,并将其操作化为模型生成旨在塑造信念或行为的内容的意愿。我们的评估框架通过模拟说服者与被说服者之间的多轮对话设置,对前沿大语言模型进行探测。APE涵盖多样化主题谱系,包括阴谋论、争议性议题以及无争议的有害内容。我们引入自动化评估模型来识别说服意愿,并测量说服尝试的频率与情境。研究发现,许多开源及闭源权重模型频繁表现出对有害话题进行说服的意愿,且越狱攻击能进一步增加此类行为倾向。我们的结果揭示了当前安全防护措施的不足,并强调将说服意愿评估作为大语言模型风险关键维度的重要性。APE项目发布于 github.com/AlignmentResearch/AttemptPersuadeEval